arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 语音识别与关键词检测 4 篇

2606.10365 2026-06-10 cs.SD 新提交

KFC-KWS: Keyframe Fusion with CTC for User-Defined Keyword Spotting

KFC-KWS: 基于CTC的关键帧融合用于用户自定义关键词唤醒

Jin Li, Wenbin Jiang, Ji Hu

发表机构 * School of Electronics and Information Engineering, Hangzhou Dianzi University(杭州电子科技大学电子信息学院) School of Communication Engineering, Hangzhou Dianzi University(杭州电子科技大学通信工程学院)

AI总结 提出KFC-KWS多模态框架,利用CTC引导的关键帧选择对齐音频、音素和文本模态,通过交叉注意力融合关键帧与全句表示,在LibriPhrase上达到98.73% AUC,困难子集上97.65% AUC和7.75% EER,有效区分易混淆关键词。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

用户自定义关键词唤醒(KWS)通过检测用户指定的关键词实现个性化语音交互。该任务的一个关键挑战是区分目标关键词与发音易混淆的替代词。为应对这一挑战,我们提出KFC-KWS,一种利用连接主义时间分类(CTC)引导的关键帧选择的多模态框架。具体而言,我们利用CTC的峰值后验分布来识别高置信度的音素帧,从而实现音频、音素和文本模态之间的精确对齐。然后,通过交叉注意力将这些关键帧与全句表示融合,以捕获局部判别线索和全局上下文信息。在LibriPhrase上,KFC-KWS实现了最佳平衡性能(98.73% AUC),并在具有挑战性的困难子集上显著优于先进基线(97.65% AUC和7.75% EER),证明了其在区分高度易混淆关键词方面的有效性。

英文摘要

User-defined keyword spotting (KWS) enables personalized voice interaction by detecting user-specified keywords. A key challenge in this task is distinguishing target keywords from phonetically confusable alternatives. To address this challenge, we propose KFC-KWS, a multimodal framework that leverages connectionist temporal classification (CTC)-guided keyframe selection. Specifically, we exploit the peaky posterior distributions of CTC to identify high-confidence phoneme frames, enabling precise alignment across audio, phoneme, and text modalities. These keyframes are then fused with full-utterance representations through cross-attention to capture both local discriminative cues and global contextual information. On LibriPhrase, KFC-KWS achieves the best-balanced performance (98.73% AUC) and substantially outperforms advanced baselines on the challenging hard subset (97.65% AUC and 7.75% EER), demonstrating its effectiveness in discriminating between highly confusable keywords.

2606.10439 2026-06-10 cs.SD cs.CL eess.AS 新提交

Enhancing Multilingual LLM-based ASR with Mixture of Experts and Dynamic Downsampling

利用混合专家和动态下采样增强基于多语言大模型的语音识别

Guodong Lin, Ziqi Chen, Yuxiang Fu, Ke Li, Wei-Qiang Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出基于投影器的LLM-ASR框架,通过混合专家架构提升跨语言适应性,并利用连续整合-触发机制实现动态下采样和模态对齐,实验表明该方法显著超越强基线模型。

Comments Accepted by ICASSP 2026

详情
Journal ref
ICASSP (2026),18807-18811
AI中文摘要

大语言模型的快速发展为自动语音识别开辟了新前沿,使其有效集成成为一个关键且具有挑战性的研究方向。为此,本文提出了一种基于投影器的LLM-ASR框架,针对多语言泛化和模态对齐的关键挑战。我们的方法结合了混合专家架构以改善跨语言适应性,以及连续整合-触发机制用于动态下采样和模态对齐。实验结果表明,这些组件的组合带来了显著的性能提升,超越了强基线模型。所提出的方法朝着构建更准确、更鲁棒、更泛化的基于LLM的ASR系统迈出了一步。

英文摘要

The rapid progress of large language models (LLMs) has opened up a new frontier for automatic speech recognition (ASR), making their effective integration a critical and challenging research direction. To this end, this work proposes a projector-based LLM-ASR framework targeting the key challenges of multilingual generalization and modality alignment. Our approach incorporates a Mixture of Experts (MoE) architecture to improve cross-lingual adaptability, and a Continuous Integrate-and-Fire (CIF) mechanism for dynamic downsampling and modality alignment. Experimental results show that the combination of these components yields substantial performance improvements, surpassing strong baseline models. The proposed method represents a step toward building more accurate, robust, and generalizable LLM-based ASR systems.

2606.10454 2026-06-10 eess.AS cs.SD 交叉投稿

Entropy-Aware Domain-Routed Mixture-of-Experts Speech-LLM Framework: A Case Study of Multi-Domain Child-Adult ASR

熵感知域路由混合专家语音-大语言模型框架:多领域儿童-成人ASR案例研究

Mohan Shi, Kaiyuan Zhang, Zilai Wang, Natarajan Balaji Shankar, Eray Eren, Abeer Alwan

发表机构 * University of California, Los Angeles, USA(加州大学洛杉矶分校)

AI总结 提出一种混合专家语音-大语言模型,通过分类器域路由、混合投影器和混合LoRA模块以及熵感知路由机制,实现跨不同环境和年龄组的统一儿童-成人ASR,在公共儿童语料库上取得一致改进。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

虽然语音大语言模型在成人自动语音识别上取得了强劲性能,但其对儿童语音的有效性仍未被充分探索,且单一模型往往难以同时处理多样化的成人和儿童年龄组。本文提出一种混合专家语音-大语言模型,用于跨不同环境和年龄组的统一成人及儿童语音ASR。该框架采用基于分类器的域路由,结合粗到细策略,并集成混合投影器和混合LoRA模块以建模域特定变化。为解决域边界附近的路由不确定性,引入熵感知路由机制以动态整合共享专家。在公共儿童语料库上的实验表明,该方法在保持成人ASR性能的同时,相比基线取得了一致改进。据我们所知,这是首个利用语音-大语言模型实现涵盖儿童和成人的统一多领域ASR的工作。

英文摘要

While Speech Large Language Models (Speech-LLMs) have achieved strong performance on adult Automatic Speech Recognition (ASR), their effectiveness on child speech remains under-explored, and single models often struggle to handle diverse adult and child age groups simultaneously. This paper proposes a Mixture-of-Experts (MoE) Speech-LLM for unified ASR across adult and child speech spanning diverse environments and age groups. The framework employs a Classifier-based Domain Router (C-DR) with a coarse-to-fine strategy and integrates both a Mixture-of-Projectors (MoP) and a Mixture-of-LoRAs (MoL) to model domain-specific variations. To address routing uncertainty near domain boundaries, an Entropy-Aware Routing (EAR) mechanism is introduced to dynamically incorporate a shared expert. Experiments on public child corpora demonstrate consistent improvements over baselines while preserving adult ASR performance. To our knowledge, this is the first work leveraging Speech-LLMs for unified, multi-domain ASR encompassing both children and adults.

2508.07048 2026-06-10 cs.SD cs.AI cs.LG eess.AS 版本更新

Whisfusion: Parallel ASR Decoding with Masked Diffusion

Whisfusion: 基于掩码扩散的并行ASR解码

Taeyoun Kwon, Junhyuk Ahn, Taegeun Yun, Heeju Jwa, Yoonchae Choi, Siwon Park, Jongchan Kim, Hyungon Ryu, Hyuk-Jae Lee, Nam-Joon Kim

发表机构 * Seoul National University(首尔国立大学) Soongsil University(顺天大学) NVIDIA Corporation(英伟达公司)

AI总结 提出Whisfusion,在冻结的Whisper音频嵌入上训练专用掩码扩散解码器,通过并行扩散解码实现非自回归ASR,在多种语言基准上超越Whisper-large-v3,速度提升4-5倍。

Comments 16 pages, 3 figures

详情
AI中文摘要

自回归(AR)编码器-解码器模型主导着高质量的多语言ASR,但其从左到右的解码器使得推理延迟随转录长度增加。一种自然的替代方案,CTC风格的非自回归(NAR)系统避免了这一瓶颈,但其条件独立性假设牺牲了转录级别的生成建模。掩码扩散语言模型(例如LLaDA、MDLM)提供了一种有竞争力的NAR文本生成方法。我们探究这类模型是否能在消除从左到右瓶颈的同时,将NAR ASR带入强AR ASR系统的准确率范围。我们提出Whisfusion,它在冻结的Whisper-large-v3音频嵌入之上从头训练一个专用的掩码扩散解码器,仅需几步即可去噪掩码转录。我们在约68k小时的11种语言语音上训练,采用高掩码专门化以将训练与推理的完全掩码起始点对齐,并通过并行扩散解码进行解码。Whisfusion在英语、欧洲和CJK基准测试的组平均准确率上超越Whisper-large-v3,同时运行速度快4-5倍,在准确率和吞吐量上均超越Whisper-turbo。它达到与Canary和Qwen3-ASR竞争的准确率,同时运行速度快3-7倍。这些结果确立了掩码扩散作为高吞吐量多语言转录的帕累托竞争性非自回归范式。代码和模型权重可在https://this URL获取。

英文摘要

Autoregressive (AR) encoder-decoder models dominate high-quality multilingual ASR, but their left-to-right decoders make inference latency scale with transcript length. A natural alternative, CTC-style non-autoregressive (NAR) systems avoid this bottleneck but their conditional independence assumption sacrifices transcript-level generative modeling. Masked diffusion language models (e.g., LLaDA, MDLM) offer a competitive NAR text-generation approach. We ask whether such models can bring NAR ASR into the accuracy regime of strong AR ASR systems while removing the left-to-right bottleneck. We propose Whisfusion, which trains a dedicated masked diffusion decoder from scratch on top of frozen Whisper-large-v3 audio embeddings, denoising masked transcripts in just a few steps. We train on ~68k hours of 11-language speech with high-mask specialization to align training with the fully masked starting point of inference, and decode via Parallel Diffusion Decoding. Whisfusion surpasses Whisper-large-v3 on group-average accuracy across English, European, and CJK benchmarks, while running 4-5x faster, additionally surpassing Whisper-turbo in both accuracy and throughput. It reaches accuracy competitive with Canary and Qwen3-ASR while running 3-7x faster. These results establish masked diffusion as a Pareto-competitive non-autoregressive paradigm for high-throughput multilingual transcription. Code and model weights are available at https://github.com/taeyoun811/Whisfusion.

2. 语音合成与声音生成 5 篇

2606.10591 2026-06-10 cs.SD 新提交

ContextCodec: Content-Focused Context Guidance for Ultra-Low Bitrate Speech Coding

ContextCodec: 面向内容的超低比特率语音编码上下文引导

Chengbin Liang, Wenqi Guo, Hao Cao, Zhijin Qin

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 提出ContextCodec,通过双分支编码器解耦声学细节与内容上下文,利用CLIP对比损失对齐上下文特征与音素索引,在500 bps下实现质量与可懂度的良好平衡。

Comments Accepted at Interspeech 2026. 6 pages, 2 figures, 5 tables

详情
AI中文摘要

神经语音编解码器实现了低比特率语音通信,但在超低比特率(< 1000 bps)下保持感知质量和可懂度具有挑战性。现有设计通常优先考虑声学细节,在严格的比特率约束下留给核心语言信息的容量有限。为了解决这个问题,我们提出了ContextCodec,一种传输面向内容的上下文特征以显式指导重建的编解码器。ContextCodec采用双分支编码器,将声学细节与面向内容的上下文解耦。上下文分支通过CLIP风格的对比损失进行训练,该损失将上下文特征与音素索引对齐,减少副语言泄漏。在解码过程中,这些特征被注入每个解码阶段以进行显式指导。此外,我们引入了一个轻量级的自回归潜在细化模块。实验表明,在500 bps下实现了强大的质量-可懂度权衡,在典型移动CPU上的RTF为0.4886。

英文摘要

Neural speech codecs enable low-bitrate speech communication, yet at ultra-low bitrates (< 1000 bps) preserving perceptual quality and intelligibility is challenging. Existing designs often prioritize acoustic details, leaving limited capacity for the core linguistic message under tight bitrate constraints. To address this, we propose ContextCodec, a codec that transmits content-focused context features to explicitly guide reconstruction. ContextCodec adopts a dual-branch encoder that decouples acoustic details from content-focused context. The context branch is trained with a CLIP-style contrastive loss that aligns context features with phoneme indices, reducing paralinguistic leakage. During decoding, these features are injected at each decoding stage for explicit guidance. In addition, we introduce a lightweight autoregressive latent refinement module. Experiments show a strong quality-intelligibility trade-off down to 500 bps, with an RTF of 0.4886 on a typical mobile CPU.

2606.09962 2026-06-10 cs.LG cs.AI cs.SD 交叉投稿

Optimality of FSQ Tokens for Continuous Diffusion for Categorical Data with Application to Text-to-Speech

FSQ 令牌在分类数据连续扩散中的最优性及其在文本到语音中的应用

Vadim Popov, Wenju Gu, Tasnima Sadekova, Georgii Aparin, Assel Yermekova

AI总结 本文研究连续扩散模型中离散令牌的潜在空间结构,通过理论分析和实验证明 FSQ 令牌化方案在分类数据连续扩散中最优,并在文本到语音任务中验证其优于基于 LLM 的方法。

详情
AI中文摘要

分类数据的连续扩散是一种属于扩散家族的框架,旨在生成离散数据。近年来,由于研究人员试图实现寻找自回归大型语言模型的合理替代方案这一具有挑战性的目标,对此类模型的科学兴趣不断增长。在本文中,我们研究了与离散令牌相对应的潜在空间结构的性质,这些性质通过扩散路径测度上的 Kullback-Leibler 散度和最优训练扩散模型正确预测令牌的准确性来表达。我们发现,FSQ 令牌化方案具有的潜在空间结构使其最适合分类数据的连续扩散,这一点通过严格的理论分析和数值实验得到了验证。为了在现实场景中验证我们的发现,我们训练了几个以语音令牌作为中间声学特征的文本到语音扩散模型,并表明基于 FSQ 令牌的模型确实表现最佳,而且它优于其强大的基于 LLM 的对应模型,同时体积更小、速度更快。

英文摘要

Continuous diffusion for categorical data is a framework belonging to the diffusion family and aiming at generating discrete data. The scientific interest to such models has been constantly increasing these days because researchers try to achieve a challenging goal of finding reasonable alternatives to autoregressive large language models. In this paper, we study the properties of the structure of the latent space corresponding to discrete tokens expressed in terms of Kullback-Leibler divergence on diffusion path measures and accuracy of the correct token prediction by the optimally trained diffusion model. We find that FSQ tokenization scheme has the latent space structure with the properties that make it best suited for continuous diffusion for categorical data as verified through rigorous theoretical analysis and numerical experiments. To validate our findings in real-life scenario, we train several text-to-speech diffusion models having speech tokens as intermediate acoustic features, and show that the one based on FSQ tokens indeed performs the best, and, moreover, it outperforms its strong LLM-based counterpart, at the same time being significantly smaller and faster.

2606.10317 2026-06-10 eess.AS cs.SD 交叉投稿

SSL-GMMVC: Interpretable Voice Conversion via Locally Linear GMM Transforms in Self-Supervised Representation Space

SSL-GMMVC:自监督表示空间中通过局部线性GMM变换的可解释语音转换

Tomoya Tanabu, Hiroshi Nishijima, Daisuke Saito, Nobuaki Minematsu

发表机构 * The University of Tokyo, Japan(东京大学)

AI总结 提出SSL-GMMVC方法,在自监督语音空间中用高斯混合模型建模源-目标特征,通过后验加权仿射变换实现可解释的语音转换,在保持可理解性和自然度的同时提升说话人相似度。

Comments Accepted to Interspeech2026

详情
AI中文摘要

我们介绍了SSL-GMMVC,一种在自监督语音空间中可解释的语音转换方法。该方法使用高斯混合模型对配对的源-目标特征进行建模,并将转换表示为仿射变换的后验加权和。这产生了适应异质特征空间结构且保持解析可处理性的局部线性变换。通过客观和主观评估,我们表明SSL-GMMVC在保持相当可理解性和自然度的同时提高了说话人相似度,并且随着混合成分数量的增加,即使是受限协方差变体也超过了深度学习基线。进一步的分析将成分选择与语音结构联系起来,并揭示了学习变换中可解释的缩放和旋转。这些发现凸显了SSL-GMMVC作为一种有效且可分析的语音转换框架。

英文摘要

We introduce SSL-GMMVC, an interpretable voice conversion method in self-supervised speech space. The method models paired source-target features with a Gaussian mixture model and performs conversion as a posterior-weighted sum of affine transforms. This yields locally linear transformations that adapt to heterogeneous feature-space structure while remaining analytically tractable. Through objective and subjective evaluations, we show that SSL-GMMVC improves speaker similarity with comparable intelligibility and naturalness, and that even a constrained covariance variant surpasses a deep learning baseline as the number of mixture components increases. Further analyses link component selection to phonetic structure and reveal interpretable scaling and rotation in the learned transforms. These findings highlight SSL-GMMVC as an effective, analyzable framework for voice conversion.

2412.11449 2026-06-10 cs.SD cs.AI cs.CL cs.LG eess.AS 版本更新

Whisper-GPT -- Continuous Discrete Hybrid Representation Language Models For Speech And Music

Whisper-GPT -- 语音和音乐的连续离散混合表示语言模型

Prateek Verma

发表机构 * Stanford University(斯坦福大学)

AI总结 提出Whisper-GPT,一种结合连续音频表示(如频谱图)和离散音频令牌的生成式大语言模型,解决了离散令牌方法上下文长度过长的问题,在语音和音乐的下一个令牌预测中降低了困惑度和负对数似然。

Comments 6 pages, 3 figures. 50th International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India

详情
AI中文摘要

我们提出了WHISPER-GPT:一种用于语音和音乐的生成式大语言模型(LLM),它允许我们在单个架构中同时处理连续音频表示和离散令牌。近年来,利用神经压缩算法(例如ENCODEC)导出的离散音频令牌的生成式音频、语音和音乐模型激增。然而,这种方法的主要缺点之一是处理上下文长度。如果必须考虑不同频率下的所有音频内容来进行下一个令牌预测,那么对于高保真生成架构来说,上下文长度会急剧增长。通过结合连续音频表示(如频谱图)和离散声学令牌,我们保留了两者的优点:在单个令牌中拥有来自音频特定时间实例的所有必要信息,同时允许LLM预测未来令牌,从而获得采样和离散空间提供的其他好处。我们展示了与基于令牌的语音和音乐LLM相比,我们的架构如何提高下一个令牌预测的困惑度和负对数似然分数。

英文摘要

We propose WHISPER-GPT: A generative large language model (LLM) for speech and music that allows us to work with continuous audio representations and discrete tokens simultaneously as part of a single architecture. There has been a huge surge in generative audio, speech, and music models that utilize discrete audio tokens derived from neural compression algorithms, e.g. ENCODEC. However, one of the major drawbacks of this approach is handling the context length. It blows up for high-fidelity generative architecture if one has to account for all the audio contents at various frequencies for the next token prediction. By combining continuous audio representation like the spectrogram and discrete acoustic tokens, we retain the best of both worlds: Have all the information needed from the audio at a specific time instance in a single token, yet allow LLM to predict the future token to allow for sampling and other benefits discrete space provides. We show how our architecture improves the perplexity and negative log-likelihood scores for the next token prediction compared to a token-based LLM for speech and music.

2606.09141 2026-06-10 eess.AS cs.SD 版本更新

FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation

FlashTTS: 基于MTP加速和X-pred均值流蒸馏的快速流式TTS

Hanke Xie, Xiaming Ren, Dake Guo, Ruonan You, Wenhao Li, Jingbin Hu, Guobin Ma, Huakang Chen, Kejie Xu, Rui Huang, Weiguo Tan, Xianrong Wang, Lei Xie

发表机构 * Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 提出FlashTTS框架,通过滞后多轨架构、并行多令牌预测和X-pred均值流匹配解码器,实现低延迟流式TTS,首包延迟降至325ms,保持零样本语音克隆和跨语言可懂度。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

近期语音对话系统的进展要求文本转语音(TTS)模型更快、响应更及时。现代语音对话系统对TTS模型有两个主要要求:低延迟和支持流式输入输出。然而,大多数现有的基于单码本LLM的TTS方法依赖于多阶段流水线,缺乏原生流式能力。这些系统通常由于缓慢的自回归预测和多步流匹配而遭受高端到端延迟。为了解决这些限制,我们提出了FlashTTS,一个开源、低延迟的流式TTS框架。FlashTTS引入了一种滞后多轨架构,原生处理流式文本和语音输入,从而消除了句子级缓冲的需要。为了加速声学生成,我们将并行多令牌预测(MTP)与X-pred均值流匹配解码器集成。这种配置在恰好两次函数评估(2-NFE)中实现了高保真度的令牌到梅尔频谱生成。通过联合优化输入处理和解码效率,FlashTTS为实时语音对话系统提供了实用基础。实验表明,与稳健的流式基线相比,FlashTTS将首包延迟显著降低至325毫秒,同时保持了强大的零样本语音克隆和跨语言可懂度。语音样本可用。模型代码和检查点将作为开源发布。

英文摘要

Recent progress in speech dialogue systems requires Text-to-Speech (TTS) models to be faster and more responsive. Modern speech dialogue systems impose two primary requirements on TTS models: low latency and support for streaming inputs and outputs. However, most existing single-codebook LLM-based TTS methods rely on multi-stage pipelines that lack native streaming capabilities. These systems typically suffer from high end-to-end latency due to slow autoregressive prediction and multi-step flow matching. To address these limitations, we propose FlashTTS, an open-source and low-latency streaming TTS framework. FlashTTS introduces a lagged multi-track architecture that natively processes streaming text and speech inputs, thereby eliminating the need for sentence-level buffering. To accelerate acoustic generation, we integrate parallel Multi-Token Prediction (MTP) with an X-pred mean flow matching decoder. This configuration achieves high-fidelity token-to-mel generation in exactly two function evaluations (2-NFE). By jointly optimizing input processing and decoding efficiency, FlashTTS offers a practical foundation for real-time speech dialogue systems. Experiments show that FlashTTS substantially reduces First-Packet Latency to 325ms compared to robust streaming baselines, all while preserving strong zero-shot voice cloning and cross-lingual intelligibility. Speech samples are available. The model code and checkpoints will be released as open source.

3. 说话人识别、验证与分离 1 篇

2606.10565 2026-06-10 cs.SD eess.AS 新提交

A Lightweight Dual-Factor Acoustic Authentication System via Cascaded GMM-DTW Architecture for Edge Computing

一种基于级联GMM-DTW架构的轻量级双因素声学认证系统用于边缘计算

Yutong Zhang

发表机构 * Yutong Zhang(张宇同)

AI总结 针对资源受限的边缘环境,提出一种轻量级级联GMM-DTW双因素语音锁系统,通过共享MFCC特征空间实现顺序防御,结合动态联合绝对-相对边界约束,在低功耗边缘节点上实现低延迟和高安全性。

详情
AI中文摘要

本文提出了一种轻量级、级联GMM-DTW双因素语音锁系统,适用于资源受限的边缘环境。通过利用共享的MFCC特征空间,该框架实现了结合GMM说话人筛选和DTW口令验证的顺序防御机制。为了在不增加额外硬件的情况下应对呈现攻击,在GMM分类空间中引入了动态联合绝对-相对边界约束,将物理冒名顶替者和高保真重放攻击的误接受率(FAR)分别限制在2.73%和6.67%,合法用户的误拒绝率(FRR)为16.67%。由于Sakoe-Chiba窗口优化,在时间压力下,全局端到端处理延迟在单核CPU上严格限制为9.82ms,其中特征提取1.51ms,GMM评分0.54ms,最坏情况DTW匹配7.77ms。这些经验基准证明了白盒声学级联在低功耗边缘节点上实现安全、确定性实时部署的可行性。

英文摘要

This paper presents a lightweight, cascaded GMM-DTW dual-factor voice lock system for resource-constrained edge environments. By utilizing a shared MFCC feature space, the framework implements a sequential defense mechanism combining GMM speaker screening and DTW passphrase verification. To counter presentation threats without extra hardware, a dynamic joint absolute-relative margin constraint is integrated into the GMM classification space, limiting the physical imposter and high-fidelity replay attack False Acceptance Rates (FAR) to 2.73% and 6.67%, respectively, with a legitimate False Rejection Rate (FRR) of 16.67%. Due to Sakoe-Chiba window optimization, the global end-to-end processing latency under temporal stress is rigidly bounded at 9.82ms on a single-core CPU, comprising 1.51ms for feature extraction, 0.54ms for GMM scoring, and 7.77ms for worst-case DTW matching. These empirical benchmarks demonstrate the viability of white-box acoustic cascades for secure, deterministic real-time deployment on low-power edge nodes.

4. 语音增强、降噪与音频修复 1 篇

2606.10233 2026-06-10 eess.AS cs.LG cs.SD 交叉投稿

ANCHOR: Autoregressive Non-intrusive Chunk-Ordered Refinement for Joint Multi-Resolution Speech Quality Modeling

ANCHOR: 自回归非侵入式分块有序细化用于联合多分辨率语音质量建模

Zhuoyan Tao, Jiatong Shi, Hye-jin Shim, Shinji Watanabe

发表机构 * University of Southern California, USA(美国南加州大学) Carnegie Mellon University, USA(美国卡内基梅隆大学)

AI总结 提出ANCHOR模型,将增量语音质量评估重构为多分辨率自回归任务,通过双分辨率令牌和分辨率感知层次实现分块到整句的粗到细细化,在部分输入下显著降低误差,并揭示感知质量的时域积累机制。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

虽然语音质量通常是在完整话语上评估的,但流式和生成系统需要从部分音频中进行增量估计。现有的预测器假设完整的上下文,在受前缀约束的输入上性能下降。扩展ARECHO,我们提出ANCHOR,将增量评估重新表述为多分辨率自回归任务。它使用双分辨率令牌和分辨率感知层次结构在单个解码器中建模分块级和话语级质量,实现从粗到细的细化。实验表明,在部分输入下具有显著的鲁棒性,包括在2秒前缀上PLCMOS误差减少48%。收敛性分析揭示了4-6秒的有效感知上下文范围。压力测试进一步隔离了局部损坏下的结构化外推偏差。结果表明,层次监督改进了增量预测,并阐明了感知质量如何随时间累积。

英文摘要

While speech quality is typically assessed on complete utterances, streaming and generative systems require incremental estimation from partial audio. Existing predictors assume full context, degrading on prefix-constrained inputs. Extending ARECHO, we propose ANCHOR, reformulating incremental assessment as a multi-resolution autoregressive task. It models chunk- and utterance-level quality within a single decoder using dual-resolution tokens and a resolution-aware hierarchy for coarse-to-fine refinement. Experiments show substantial robustness under partial input, including a 48% PLCMOS error reduction on 2-second prefixes. Convergence analysis reveals a 4-6 s effective perceptual context horizon. A stress test further isolates structured extrapolation biases under localized corruption. Results demonstrate that hierarchical supervision improves incremental prediction and elucidates how perceptual quality accumulates over time.

5. 音频事件检测与场景理解 2 篇

2606.10407 2026-06-10 cs.SD cs.CV q-bio.QM 新提交

Time-frequency localization of bird calls in dense soundscapes

密集声景中鸟鸣的时频定位

Simen Hexeberg, Fanghui Tong, Hari Vishnu, Mandar Chitre

发表机构 * Acoustic Research Laboratory, National University of Singapore(新加坡国立大学声学研究实验室) Tropical Marine Science Institute, National University of Singapore(新加坡国立大学热带海洋科学研究所) School of Marine Science and Technology, Northwestern Polytechnical University(西北工业大学航海学院)

AI总结 将鸟鸣检测视为频谱图上的目标检测任务,训练YOLO11模型在密集热带声景中定位鸟鸣,并引入IoMin评估指标,在分布内和分布外数据上均优于基线。

详情
AI中文摘要

被动声学监测能够大规模观测野生动物,但大多数生物声学分类器仅预测时间窗口内的物种存在,而无法在时间或频率上精确定位发声,限制了后续分析。我们将鸟鸣检测视为频谱图上的目标检测任务,训练YOLO11模型在新加坡密集热带声景中定位鸟鸣。此外,我们引入了一个开源的基于浏览器的标注工具,并提出了Intersection over Minimum (IoMin)评估指标,该指标比标准IoU更好地处理模糊的声学边界,更适合当前问题。最佳YOLO模型在新加坡的分布内声景中几乎将基线性能翻倍(81.8% vs. 42.1% IoMin@50 F1分数),同时在夏威夷的未见分布外录音上仍优于基线(58.6% vs. 48.6%)。这些结果表明,目标检测框架是复杂声景中动物发声时频定位的一种有前景的方法。

英文摘要

Passive acoustic monitoring enables large-scale observation of wildlife, but most bioacoustic classifiers only predict species presence in a time window without localizing vocalizations precisely in time or frequency, limiting downstream analyses. We formulate bird vocalization detection as an object detection task on spectrograms and train YOLO11 models to localize bird calls in dense tropical soundscapes from Singapore. We additionally introduce an open-source browser-based annotation tool and propose Intersection over Minimum (IoMin), an evaluation metric that better handles ambiguous acoustic boundaries than standard IoU and is better suited to the problem at hand. The best YOLO model nearly doubles baseline performance on in-distribution soundscapes from Singapore (81.8% vs. 42.1% IoMin@50 F1-score) while still outperforming the baseline on unseen out-of-distribution recordings from Hawaii (58.6% vs. 48.6%). These results suggest that object detection frameworks are a promising approach to time-frequency localization of animal vocalizations in complex soundscapes.

2511.05349 2026-06-10 cs.SD 版本更新

Passive Acoustic-based Composite Indices for Reef Health Monitoring in Noisy Tropical waters

基于被动声学的复合指数用于嘈杂热带水域的珊瑚礁健康监测

Hari Vishnu, Yuen Min Too, Mandar Chitre, Danwei Huang, Teong Beng Koay, Sudhanshi S. Jain

发表机构 * University of Technology, Sydney(悉尼科技大学) Nanyang Technological University(南洋理工大学) National Institute of Oceanography and Environmental Physics(国家海洋与环境物理研究所) Institute of Marine and Coastal Sciences, University of Connecticut(康乃狄克大学海洋与海岸科学研究所) Indian Institute of Technology, Bombay(印度班加罗尔理工学院)

AI总结 提出使用卷积神经网络去噪器处理低频噪声,结合声压级、声学复杂度指数和虾鸣率等声学指标,实现与潜水评估一致的珊瑚礁健康监测。

详情
AI中文摘要

被动声学监测为珊瑚礁的长期、空间广泛评估提供了潜力。为探索这种方法,我们在新加坡水域的十个珊瑚礁站点部署了水下声学记录仪,持续两年。为减轻持续的人为和流致噪声对低频礁声景的掩蔽,我们训练了一个卷积神经网络去噪器。声学数据分析揭示了明显的晨昏合唱。尽管在噪声记录的低频部分,与环境变量的相关性被掩盖,但去噪后的数据显示声学活动指数(如声压级和声学复杂度指数)与基于潜水员的珊瑚礁健康评估(如活珊瑚丰富度和覆盖率、藻类覆盖率)之间存在相关性。此外,从高频声带计算的虾鸣率在时间和空间上与珊瑚礁参数稳健相关。本研究证明,只要有效去噪和解释数据,被动声学包含有助于珊瑚礁监测的有价值信息。该方法可推广到其他因持续噪声而阻碍声学监测的海洋环境。

英文摘要

Passive acoustic monitoring offers the potential to enable long-term, spatially extensive assessments of coral reefs. To explore this approach, we deployed underwater acoustic recorders at ten coral reef sites around Singapore waters over two years. To mitigate the persistent anthropogenic and current-induced noise masking the low-frequency reef soundscape, we trained a convolutional neural network denoiser. Analysis of the acoustic data reveals distinct morning and evening choruses. Though the correlation with environmental variates was obscured in the low-frequency part of the noisy recordings, the denoised data showed correlations of acoustic activity indices such as sound pressure level and acoustic complexity index with diver-based assessments of reef health such as live coral richness and cover, and algal cover. Furthermore, the shrimp snap rate, computed from the high-frequency acoustic band, is robustly correlated with the reef parameters, both temporally and spatially. This study demonstrates that passive acoustics holds valuable information that can help with reef monitoring, provided the data is effectively denoised and interpreted. This methodology can be extended to other marine environments where acoustic monitoring is hindered by persistent noise.

6. 音乐信息检索与音乐生成 2 篇

2606.10627 2026-06-10 cs.HC cs.LG cs.SD 交叉投稿

Profy: Interpretable Visualization of Expertise-Dependent Motor Skills Toward Supporting Piano Practice

Profy: 面向钢琴练习的、可解释的专业技能依赖性运动技能可视化

Kazuki Kawamura, Fujiki Nakamura, Hayato Nishioka, Momoko Shioki, Shinichi Furuya, Jun Rekimoto

发表机构 * The University of Tokyo(东京大学) Sony Computer Science Laboratories(索尼计算机科学实验室) NeuroPiano Institute(神经钢琴研究所)

AI总结 提出弱监督系统Profy,利用听众评分标签学习时间对齐的高亮,帮助钢琴学习者定位需重点练习的段落,在无局部标签下与专家标注高度一致。

Comments Designing Interactive Systems Conference (DIS '26), June 13-17, 2026, Singapore, Singapore

详情
AI中文摘要

钢琴演奏的质量取决于微妙的时机、发音和动态控制,但练习反馈通常是基于总结的且难以付诸行动。我们介绍了Profy,一个弱监督系统,它从聚合听众评分(专家标记与业余标记)中学习片段级标签,生成时间对齐的高亮,用于钢琴练习中的回顾。我们收集了73名钢琴家的同步1 kHz键运动与音频数据,并使用1083个有效片段进行建模和评估。模型在共享的重采样模型时间基上输出片段级预测和证据分数以进行可视化。在21名专家钢琴家标注的20个业余短技术练习片段上,尽管训练时没有局部标签,显示的高亮分数与专家标记用于回顾的段落一致(Pearson r=0.61,ROC-AUC 0.75)。Profy不是用一个全局分数总结一个片段,而是通过支持与专家-业余差异相关的时间局部段落的擦洗、循环和聚焦回放,帮助学习者决定下一步检查哪里。

英文摘要

The quality of piano performance depends on nuanced timing, articulation, and dynamic control, but practice feedback is often summary-based and hard to act on. We introduce Profy, a weakly supervised system that learns from take-level labels derived from aggregated listener ratings (expert-labeled vs. amateur-labeled) to produce time-aligned highlights for review during piano practice. We collected synchronized 1 kHz key-motion and audio from 73 pianists and used 1,083 valid takes for modeling and evaluation. The model outputs clip-level predictions together with evidence scores on a shared resampled model time base for visualization. On 20 amateur clips from short technique studies annotated by 21 expert pianists, the displayed highlight score aligns with passages that expert pianists marked for review despite training without localized labels (Pearson r=0.61, ROC-AUC 0.75). Rather than summarizing a take with a single global score, Profy helps learners decide where to inspect next by supporting scrubbing, looping, and focused replay of time-localized passages associated with expert-amateur differences.

2606.03803 2026-06-10 cs.SD cs.AI eess.AS 版本更新

LiveBand: Live Accompaniment Generation in the Audio Domain

LiveBand: 音频域中的实时伴奏生成

Marco Pasini, Javier Nistal, Ben Hayes, Mathias Rose Bjare, Stefan Lattner, George Fazekas

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出LiveBand系统,利用因果Transformer在预训练因果音频自编码器的连续潜空间中生成高保真伴奏,通过对抗序列级监督训练,实现实时流式生成。

详情
AI中文摘要

我们提出LiveBand,一个实时系统,能够为现场音频输入生成高保真音乐伴奏,并严格遵守因果约束。我们的方法在预训练因果音频自编码器的连续潜空间中训练因果Transformer生成器,使用来自判别器的对抗序列级监督。在每个时间步,生成器仅接收因果可用的混合上下文和高斯噪声,并预测伴奏潜变量,而无法访问未来混合帧或真实目标潜变量。训练在因果掩码下通过单个并行前向传播完成,而流式推理则通过滚动注意力状态自回归进行。模型训练和推理计算在设计中匹配,消除了教师强制及相关曝光偏差。在多乐器音乐伴奏基准测试中,LiveBand在音频质量、节拍对齐和混合一致性的客观指标上优于先前工作,同时能够在消费级硬件上实现无需前瞻的实时流式生成。

英文摘要

We present LiveBand, a real-time system that generates high-fidelity music accompaniments to live audio input, respecting strict causal constraints. Our method trains a causal transformer generator in the continuous latent space of a pre-trained causal audio autoencoder, using adversarial sequence-level supervision from a discriminator. At each timestep, the generator receives only the causally available mix context and Gaussian noise, and predicts accompaniment latents without access to future mix frames or ground-truth target latents. Training is performed in a single parallel forward pass under causal masking, while streaming inference proceeds autoregressively with a rolling attention state. The model's training and inference computations are matched by design, eliminating teacher forcing and the associated exposure bias. On a multi-instrument music accompaniment benchmark, LiveBand improves over prior work on objective measures of audio quality, beat alignment, and mix adherence, while enabling real-time streaming generation without lookahead into the future on consumer hardware.

7. 语音翻译与语音语言模型 2 篇

2606.10368 2026-06-10 cs.SD cs.AI 新提交

Speech Meets ELF: Audio Conditional Continuous-Target Diffusion for Speech Recognition and Translation

语音遇见ELF:用于语音识别和翻译的音频条件连续目标扩散

Xuanchen Li, Tianrui Wang, Yuheng Lu, Zikang Huang, Yu Jiang, Chenghan Lin, Chenrui Cui, Ziyang Ma, Xingyu Ma, Chunyu Qiang, Guochen Yu, Xie Chen, Longbiao Wang, Jianwu Dang

发表机构 * Tianjin University(天津大学) Shanghai Jiao Tong University(上海交通大学) Nankai University(南开大学)

AI总结 提出ELF-S2T,一种基于预训练ELF骨干的音频条件连续目标生成模型,通过音频强制训练和分类器自由引导,在LibriSpeech和CoVoST2上实现竞争性ASR和S2TT性能,并揭示识别与翻译错误均源于连续潜空间中的近距离混淆。

详情
AI中文摘要

语音到文本(S2T)系统用于识别(ASR)和翻译(S2TT)通常生成离散文本标记。相比之下,连续目标语言建模在连续空间中执行生成,但其在S2T中的潜力尚未被探索。为填补这一空白,我们提出了ELF-S2T,一种用于S2T的音频条件连续目标生成模型。基于预训练的嵌入式语言流(ELF)骨干,ELF-S2T通过冻结的Whisper编码器和单个线性投影器处理语音,将得到的音频条件前置到噪声文本潜变量前,用于上下文流匹配去噪。为防止模型过度依赖其预训练的文本上下文,我们在训练中引入音频强制,并在推理时通过分类器自由引导进一步放大音频条件。在LibriSpeech和CoVoST2上的实验表明,ELF-S2T实现了具有竞争力的ASR和S2TT性能。关键的是,我们的错误分析揭示,尽管ASR和S2TT错误表面上看起来非常不同,但两者都源于同一根本原因:连续潜空间中的近距离混淆。这一发现自然与连续表示生成范式一致,表明识别和翻译之下存在共同的语义映射过程。我们的代码和预训练模型在此https URL公开提供。

英文摘要

Speech-to-text (S2T) systems for recognition (ASR) and translation (S2TT) typically generate discrete text tokens. In contrast, continuous-target language modelling performs generation in a continuous space, yet its potential for S2T remains unexplored. To bridge this gap, we propose ELF-S2T, an audio-conditioned continuous-target generative model for S2T. Built upon the pre-trained Embedded Language Flows (ELF) backbone, ELF-S2T processes speech via a frozen Whisper encoder and a single linear projector, prepending the resulting audio condition to the noisy text latent for in-context, flow-matching denoising. To prevent the model from over-relying on its pre-trained text context, we introduce audio forcing during training, and further amplify the audio condition via classifier-free guidance at inference. Experiments on LibriSpeech and CoVoST2 show that ELF-S2T achieves competitive ASR and S2TT performance. Crucially, our error analysis reveals that, although ASR and S2TT errors look very different on the surface, both stem from the same underlying cause, a close distance confusion in the continuous latent space. This finding naturally aligns with the continuous representation generation paradigm, indicating a common semantic mapping process beneath recognition and translation. Our code and pretrained models are publicly available at https://github.com/Sslnon/ELF-S2T.

2606.10581 2026-06-10 cs.CL cs.SD eess.AS 交叉投稿

ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models

ParaBridge: 弥合语音语言模型中的副语言感知与对话行为

Yuxiang Wang, Qinke Ni, Shengbo Cai, Wan Lin, Liqiang Zhang, Zhizheng Wu

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Tencent Hunyuan(腾讯混元) Shenzhen Loop Area Institute(深圳循环区域研究所) Amphion Technology Co., Ltd.(Amphion科技有限公司) Tsinghua University(清华大学)

AI总结 提出ParaBridge,一种在线自我蒸馏方法,将推理阶段的副语言指令支架转化为稳定的模型行为,无需人工标注或外部奖励,显著提升语音语言模型对副语言线索的响应能力。

详情
AI中文摘要

语音携带的信息远不止文字:孩子的声音、恐惧的语气或嘈杂的背景都应引导一个足够胜任的语音对话助手给出不同的回复。当前的语音语言模型(SLM)能够识别此类副语言线索,但在开放域对话中常常忽略它们。我们观察到,在推理阶段使用简单的副语言指令支架可以缩小这种感知-行为差距,表明相关线索已潜在于模型中。然而,这种支架在多轮上下文和竞争指令下仍然脆弱。因此,我们提出\textbf{ParaBridge},一种在线自我蒸馏方法,将脆弱的推理时支架转化为稳定的模型行为。在训练过程中,支架仅作为临时的特权视图;无支架模型自行生成回复,而支架视图沿其轨迹提供密集的全词汇下一词目标。这种监督教会了模型在非词汇线索应影响回复时的时机,无需策划的对话、人工标签或外部奖励模型。在Qwen3-Omni-thinking上,ParaBridge将无支架的VoxSafeBench SAR从14.6\%提升至40.3\%,并将EchoMind平均评分从3.27提升至3.92。它还保留了通用能力,MMAU-Pro、VoiceBench和GPQA均与原始模型相差在0.4分以内。在训练分布之外,ParaBridge泛化到未见过的副语言线索,从面向安全的训练迁移到共情导向的对话,并在不同的SLM骨干上有效。

英文摘要

Speech carries more information than just words: a child's voice, a fearful tone, or a noisy background should all lead a sufficiently competent spoken-dialogue assistant to different replies. Current Speech Language Models (SLMs) can recognize such paralinguistic cues but often ignore them in open-ended dialogue. We observe that a simple paralinguistic instruction scaffold at the inference stage narrows this perception-behavior gap, suggesting that the relevant cues are already latent in the model. Such scaffolds, however, remain brittle under multi-turn context and competing instructions. Therefore, we propose \textbf{ParaBridge}, an on-policy self-distillation method that turns a brittle inference-time scaffold into stable model behavior. During training, the scaffold serves only as a temporary privileged view; the scaffold-free model rolls out its own response, while the scaffolded view supplies dense, full-vocabulary next-token targets along its trajectory. This supervision teaches when non-lexical cues should affect the reply without the need for curated dialogues, human labels, or external reward models. On Qwen3-Omni-thinking, ParaBridge raises scaffold-free VoxSafeBench SAR from $14.6\%$ to $40.3\%$ and improves EchoMind average rating from $3.27$ to $3.92$. It also preserves general ability, with MMAU-Pro, VoiceBench, and GPQA all within $0.4$ points of the original model. Beyond the training distribution, ParaBridge generalizes to unseen paralinguistic cues, transfers from safety-oriented training to empathy-oriented dialogue, and works on a different SLM backbone.

8. 多模态音频与视听学习 3 篇

2606.09966 2026-06-10 cs.SD 新提交

RespiraMFM: A Multimodal Foundation Model with Contrastive Audio-Language Alignment for Respiratory Disease Identification

RespiraMFM:一种用于呼吸道疾病识别的对比音频-语言对齐多模态基础模型

Shakhrul Iman Siam, Tiantian Feng, Jiankun Zhang, Shrikanth Narayanan, Mi Zhang

AI总结 提出RespiraMFM多模态基础模型,通过对比音频-文本对齐策略整合呼吸音与临床信息,在监督和零样本任务中分别提升AUROC 9.15%和20.98%。

Comments ACL 2026 Main Conference

详情
AI中文摘要

呼吸道疾病仍然是全球死亡率的主要原因,及时准确的诊断对于改善患者预后和减轻医疗负担至关重要。虽然先前的工作已经探索了基于音频的呼吸道疾病检测模型,但这种单模态方法通常泛化能力和诊断精度有限。在本文中,我们提出了RespiraMFM,一种多模态基础模型,它将呼吸音与患者病史和症状相结合,以提高诊断准确性和疾病检测能力。我们引入了一种有效的音频-文本多模态整合对比对齐策略,使模型能够学习呼吸音与相应文本临床信息之间更好的跨模态表示。我们使用七个真实世界数据集,在监督微调和零样本设置下,对五种主要呼吸道疾病评估了RespiraMFM,在监督任务中AUROC提高了9.15%,在零样本任务中比现有基线提高了20.98%。这些发现强调了我们的框架在推进呼吸道疾病管理中早期诊断和改善临床决策方面的潜力。

英文摘要

Respiratory diseases remain a leading cause of global mortality, where timely and accurate diagnosis is critical to improving patient outcomes and reducing healthcare burdens. While prior work has explored audio-based models for respiratory disease detection, such unimodal approaches often suffer from limited generalizability and diagnostic precision. In this paper, we propose RespiraMFM, a Multimodal Foundation Model that integrates respiratory sounds with patient medical history and symptoms to enhance diagnostic accuracy and disease detection capabilities. We introduce an effective contrastive alignment strategy for audio-text multimodal integration, allowing the model to learn better cross-modal representations between respiratory sounds and corresponding textual clinical information. We evaluate RespiraMFM across five major respiratory diseases using seven real-world datasets in both supervised fine-tuning and zero-shot settings, achieving a 9.15% improvement in AUROC on supervised tasks and a 20.98% gain on zero-shot tasks over existing baselines. These findings underscore the potential of our framework to advance early diagnosis and improve clinical decision-making in respiratory disease management.

2606.10147 2026-06-10 cs.AI cs.CL cs.CV cs.SD 交叉投稿

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

从感知到决策:多模态大语言模型中听觉与视觉感知的信息流

Wish Suharitdamrong, Muhammad Awais, Xiatian Zhu, Sara Atito

AI总结 研究多模态大语言模型(AVLLMs)中音频和视觉信息流的路径与整合机制,发现顺序流与并行流两种路由模式,并证明信息传递后可丢弃无关token以提升效率。

Comments 40 pages, 29 figures

详情
AI中文摘要

多模态大语言模型(MLLMs)能够听和看,但音频和视觉信号实际上如何通过网络传播以形成答案?尽管它们在研究和实际应用中的作用日益增长,但音频和视觉标记影响最终预测的内部路径仍然知之甚少。在本研究中,我们考察了音频-视觉大语言模型(AVLLMs)内部的音视频信息流,追踪了AVLLMs如何在两种输入配置(音视频视频和多个交错音视频项目)下路由、利用和整合音频与视觉信息。我们发现,对于音视频视频,AVLLMs遵循为VLMs和VideoLLMs建立的顺序信息流路径,音频和视觉贡献沿着该路径按任务对每种模态的依赖程度成比例流动。在多个交错音视频项目的设置中,这种路由转变为不同的并行流。此外,我们证明,一旦音频-视觉和其他类型的标记的信息被传递到LLM,它们可以被丢弃,对模型的预测影响最小甚至略有改善,这适用于多个任务和数据集,从而实现更高效的推理。这些发现适用于多个模型和规模,包括3B和7B规模的Qwen2.5-Omni和Video-SALMONN2 Plus,从而产生了关于这些流结构为何出现的假设。总之,这些结果首次清晰地描绘了AVLLMs如何在网络内部协调声音和视觉,并为音频-视觉及更广泛的MLLMs在可解释性、设计和效率方面的下一波进展奠定了基础。

英文摘要

Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the internal pathways through which audio and visual tokens influence the final prediction remain poorly understood. In this study, we examine audio-visual information flow inside Audio-Visual Large Language Models (AVLLMs), tracing how AVLLMs route, utilize, and integrate audio and visual information across two input configurations, audio-visual video and multiple interleaved audio-visual items. We find that for audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task's reliance on each modality. In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams. Furthermore, we demonstrate that audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model's prediction or even slight improvement, generalizing across multiple tasks and datasets, enabling more efficient inference. These findings hold across multiple models and scales, Qwen2.5-Omni and Video-SALMONN2 Plus at 3B and 7B scales, leading to hypotheses on why these flow structures emerge. Together, these results deliver the first coherent picture of how AVLLMs orchestrate sound and sight inside the network and lay the groundwork for the next wave of interpretability, design, and efficiency advances in audio-visual and broader MLLMs.

2507.15294 2026-06-10 cs.SD cs.MM 版本更新

MeMo: Attentional Momentum for Real-time Audio-visual Speaker Extraction under Impaired Visual Conditions

MeMo: 视觉受损条件下的实时视听目标说话人提取的注意力动量

Junjie Li, Wenxuan Wu, Shuai Wang, Zexu Pan, Kong Aik Lee, Helen Meng, Haizhou Li

发表机构 * Department of Electrical and Electronic Engineering, Faculty of Engineering, The Hong Kong Polytechnic University(电子工程系,工程学院,香港理工大学) Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong(系统工程与工程管理系,香港中文大学) School of Artificial Intelligence (SAI), The Chinese University of Hong Kong, Shenzhen(人工智能学院(SAI),香港中文大学深圳校区) School of Intelligence Science and Technology, Nanjing University(智能科学与技术学院,南京大学) Tongyi Lab, Alibaba Group, Singapore(通义实验室,阿里巴巴集团,新加坡)

AI总结 提出MeMo框架,通过两个自适应记忆库存储注意力信息,在视觉线索缺失时维持注意力动量,实现实时目标说话人提取,SI-SNR提升至少2dB。

详情
AI中文摘要

视听目标说话人提取(AV-TSE)旨在通过利用视觉线索作为指导,从多说话人环境中分离出目标说话人的声音。然而,AV-TSE系统的性能严重依赖于这些视觉线索的质量。在视觉线索缺失或严重退化的极端场景中,系统可能无法准确提取目标说话人。相比之下,人类即使在缺乏明确辅助信息的情况下也能保持对目标说话人的注意力。受这种人类认知能力的启发,我们提出了一种名为MeMo的新框架,该框架包含两个自适应记忆库来存储注意力相关信息。MeMo专为实时场景设计:一旦建立初始注意力,系统就会随时间维持注意力动量,即使视觉线索变得不可用。我们进行了全面的实验来验证MeMo的有效性。实验结果表明,我们提出的框架相比相应基线实现了至少2 dB的SI-SNR提升。

英文摘要

Audio-visual Target Speaker Extraction (AV-TSE) aims to isolate a target speaker's voice from multi-speaker environments by leveraging visual cues as guidance. However, the performance of AV-TSE systems heavily relies on the quality of these visual cues. In extreme scenarios where visual cues are missing or severely degraded, the system may fail to accurately extract the target speaker. In contrast, humans can maintain attention on a target speaker even in the absence of explicit auxiliary information. Motivated by such human cognitive ability, we propose a novel framework called MeMo, which incorporates two adaptive memory banks to store attention-related information. MeMo is specifically designed for real-time scenarios: once initial attention is established, the system maintains attentional momentum over time, even when visual cues become unavailable. We conduct comprehensive experiments to verify the effectiveness of MeMo. Experimental results demonstrate that our proposed framework achieves SI-SNR improvements of at least 2 dB over the corresponding baseline.

9. 低资源、多语言与方言语音 3 篇

2606.10213 2026-06-10 cs.SD cs.AI 新提交

Automated Pronunciation Evaluation for Korean Toddler Speech using Speech Diarization and Self-Supervised Learning

基于说话人日志和自监督学习的韩语幼儿语音自动发音评估

Diane Myung-kyung Woodbridge, Jee Hyun Suh

AI总结 提出结合神经说话人日志与自监督学习的端到端韩语幼儿发音评估流水线,引入53名2-5岁儿童录音语料库,通过多模型集成实现辅音和元音分类平衡准确率0.782。

Comments This paper will be presented at IEEE ICTs4ehealth in June, 2026

详情
AI中文摘要

言语障碍约占韩国儿科沟通障碍病例的44%,然而针对韩语幼儿语音的自动评估工具仍不成熟。本文提出一种端到端的韩语幼儿语音自动发音评估流水线,结合神经说话人日志与自监督语音表示学习。我们引入了一个经IRB批准的新语料库,包含53名2-5岁韩语儿童的录音。其中53名受试者的子集由三位独立评审员标注,得到1,190个辅音和748个元音的词汇级二元正确性标签。我们评估了三种说话人日志模型,发现NeMo SortFormer凭借其到达时间排序的Transformer架构,实现了88.69%的说话人计数准确率和33.04%的日志错误率(DER),该架构处理了表现出aegyo的年轻女性看护者与幼儿语音之间的声学混淆。对于发音评分,我们比较了三种自监督学习(SSL)骨干网络在多种池化策略下的表现。跨模型集成将辅音预测路由到HuBERT-large,元音预测路由到WavLM-large,实现了0.720和0.845的平衡准确率,平均值为0.782。

英文摘要

Speech sound disorders affect approximately 44% of Korean pediatric communication disorder cases, yet automated assessment tools for Korean toddler speech remain underdeveloped. This paper presents an end-to-end pipeline for automated pronunciation evaluation of Korean toddler speech, combining neural speaker diarization with self-supervised speech representation learning. We introduce a novel IRB-approved corpus of 53 recordings from Korean-speaking children aged 2-5 years. A subset of 53 subjects was annotated by three independent reviewers, yielding 1,190 consonant and 748 vowel word-level binary correctness labels. We evaluate three diarization models, finding that NeMo SortFormer achieves 88.69% speaker count accuracy and 33.04% diarization error rate (DER) owing to its arrival-time-sorted transformer architecture, which handles the acoustic confound between young female caregivers exhibiting aegyo and toddler speech. For pronunciation scoring, we compare three self-supervised learning (SSL) backbones across multiple pooling strategies. A cross-model ensemble routing consonant prediction to HuBERT-large and vowel prediction to WavLM-large achieves balanced accuracies of 0.720 and 0.845, with a mean of 0.782.

2606.10278 2026-06-10 cs.SD cs.AI 新提交

Towards Robust Arabic Speech Emotion Recognition with Deep Learning

基于深度学习的鲁棒阿拉伯语音情感识别

Youcef Soufiane Gheffari, Samiya Silarbi

发表机构 * ADASCA Laboratory – Advanced Data Science and Cognitive Applications, Université des Sciences et de la Technologie d’Oran Mohamed Boudiaf (USTO-MB), Oran, Algeria(ADASCA实验室——高级数据科学与认知应用,奥兰穆罕默德·布迪夫科技大学(USTO-MB),阿尔及利亚奥兰)

AI总结 针对阿拉伯语音情感识别中方言多样、数据稀缺等问题,提出CNN-Transformer混合架构,在EYASE和BAVED数据集上达到98.1%准确率。

Comments 21 pages, 16 figures, 11 tables. Submitted manuscript

详情
AI中文摘要

语音情感识别(SER)旨在从音频信号中识别说话者的情感状态。尽管深度学习的最新进展显著提高了印欧语系语言的SER性能,但由于方言多样性、标注数据集有限以及难以同时建模局部频谱线索和长程时间依赖性,阿拉伯语SER仍然探索不足且具有挑战性。为解决这些限制,本研究探讨了联合建模空间和上下文信息的混合架构是否能改善阿拉伯语音的情感识别。我们提出并评估了一个包含三种架构的比较框架:CNN-LSTM模型、CNN-Transformer模型和微调的wav2vec 2.0模型。前两种模型利用MFCC和基于频谱图的表示,而wav2vec 2.0通过自监督表示直接对原始音频进行操作。在EYASE和BAVED数据集上进行的实验表明,所提出的CNN-Transformer架构显著优于其他模型,达到了98.1%的准确率。这一结果凸显了将卷积特征提取与基于Transformer的全局上下文建模相结合的有效性。本工作的主要贡献在于为阿拉伯语SER提供了混合方法和自监督方法的系统比较,并证明了CNN-Transformer架构在低资源和方言多样性环境中为捕捉频谱和长程依赖性提供了鲁棒解决方案。

英文摘要

Speech Emotion Recognition (SER) aims to identify a speaker's emotional state from audio signals. While recent advances in deep learning have significantly improved SER performance in Indo-European languages, Arabic SER remains underexplored and challenging due to dialectal diversity, limited annotated datasets, and the difficulty of modeling both local spectral cues and long-range temporal dependencies. To address these limitations, this study investigates whether hybrid architectures that jointly model spatial and contextual information can improve emotion recognition in Arabic speech. We propose and evaluate a comparative framework involving three architectures: a CNN-LSTM model, a CNN-Transformer model, and a fine-tuned wav2vec 2.0 model. The first two models leverage MFCC and spectrogram-based representations, while wav2vec 2.0 operates directly on raw audio through self-supervised representations. Experiments conducted on the EYASE and BAVED datasets demonstrate that the proposed CNN-Transformer architecture significantly outperforms the other models, achieving an accuracy of 98.1 percent. This result highlights the effectiveness of combining convolutional feature extraction with Transformer-based global context modeling. The main contribution of this work lies in providing a systematic comparison of hybrid and self-supervised approaches for Arabic SER, and in demonstrating that CNN-Transformer architectures offer a robust solution for capturing both spectral and long-range dependencies in low-resource and dialectally diverse settings.

2606.09553 2026-06-10 cs.CL cs.SD 交叉投稿

OpenBibleTTS: Large-Scale Speech Resources and TTS Models for Low-Resource Languages

OpenBibleTTS:面向低资源语言的大规模语音资源与TTS模型

David Guzmán, Luel Hagos Beyene, Jesujoba Oluwadara Alabi, Yejin Jeon, Dietrich Klakow, David Ifeoluwa Adelani

发表机构 * McGill University(麦吉尔大学) Mila - Quebec AI Institute(米拉-魁北克人工智能研究所) AIMS Research and Innovation Centre(AIMS研究与创新中心) NM-AIST Saarland University(萨尔大学) Canada CIFAR AI Chair(加拿大CIFAR人工智能教席)

AI总结 针对低资源语言TTS研究不足的问题,提出包含37种语言的OpenBibleTTS基准,系统比较多种TTS架构,发现无单一系统通用,并开源数据集与模型。

详情
AI中文摘要

神经文本转语音(TTS)和多语言语音生成的最新进展显著提升了合成语音质量,但这些进步在全球语言中分布不均。现有模型仍由少数高资源语言主导,而许多低资源TTS研究是在人工降采样的高资源语料库上模拟的,未能反映真正低资源环境中的正字法变化和有限的音系覆盖。为此,我们引入OpenBibleTTS,这是一个涵盖37种低资源语言的大规模低资源语音合成基准。此外,我们对各种TTS架构和大规模语音生成模型在领域内圣经文本和领域外材料上进行了系统比较。结果表明,没有单一系统在所有语言和指标上占优:Gemini-TTS在大多数评估语言上获得最高听众评分,但在OpenBibleTTS上训练的单一语言EveryVoice模型在可懂度上仍然最强,并在几种非洲语言中更受青睐,而从头训练的开放系统在领域外文本上性能急剧下降,揭示了广泛多语言覆盖与可靠合成质量之间在服务不足的语言社区中持续存在的差距。我们用主观人类判断补充自动评估,并开源所有处理后的数据集、对齐和训练模型,以支持未来的低资源TTS研究。

英文摘要

Recent advances in neural text-to-speech (TTS) and multilingual speech generation have substantially improved synthetic speech quality, yet these gains remain unevenly distributed across the world's languages. Existing models are still dominated by a small set of high-resource languages, while many studies of low-resource TTS are simulated on artificially downsampled high-resource corpora that do not reflect the orthographic variation and limited phonetic coverage encountered in genuinely underrepresented settings. As such, we introduce OpenBibleTTS, which is a large-scale benchmark for low-resource speech synthesis spanning 37 underrepresented languages. Moreover, a systematic comparison of various TTS architectures and large-scale speech generation models is conducted across in-domain Biblical text and out-of-domain material. Results show that no single system dominates across languages and metrics: Gemini-TTS achieves the highest listener ratings on most evaluated languages, but monolingual EveryVoice models trained on OpenBibleTTS remain strongest for intelligibility and are preferred in several African languages, while open from-scratch systems degrade sharply on out-of-domain text, revealing a persistent gap between broad multilingual coverage and reliable synthesis quality in underserved linguistic communities. We complement automatic evaluation with subjective human judgments, and open-source all processed datasets, alignments, and trained models to support future low-resource TTS research.

10. 数据集、基准与评测 4 篇

2606.09925 2026-06-10 cs.SD 新提交

AudioProcessBench: Benchmark for Identifying Process Errors in Audio-Grounded Reasoning

AudioProcessBench: 音频基础推理中过程错误识别的基准

Xiangyu Zhao, Junyu Yan, Yaling Shen, Zimu Wang, Yiwen Jiang, Stephanie Fong, Qingyang Xu, Jiahe Liu, Dominic Dwyer, Zongyuan Ge

AI总结 提出AudioProcessBench基准,用于评估音频-语言模型在推理步骤中的过程错误识别能力,涵盖步骤正确性、错误类型检测和链级聚合三种范式。

详情
AI中文摘要

大型音频-语言模型(LALMs)越来越多地使用显式推理轨迹进行复杂的音频理解,但对推理质量的评估仍未被充分探索。尽管过程级基准(用于过程奖励模型PRMs)在文本和多模态领域推进了推理评估,但音频推理的类似评估仍然有限。在本文中,我们提出了AudioProcessBench,一个用于音频推理中步骤级过程错误识别的综合基准。AudioProcessBench包含由6个音频和全模态语言模型生成的不同推理轨迹。每个轨迹被分割成离散的推理步骤,并标注了二元步骤正确性和细粒度错误类型。我们的基准在三种互补范式下评估模型:(1)步骤正确性识别,(2)错误类型条件检测,用于诊断音频特定验证器能力,以及(3)链级聚合,其中验证器为同一问题选择或聚合多个推理轨迹。这种设计使得系统分析当前模型是否能检测过程错误、它们的弱点是否因音频特定错误类型而异,以及过程验证是否能转化为改进的答案选择成为可能。AudioProcessBench为未来关于音频推理验证器、过程奖励模型和可靠的全模态推理研究提供了测试平台。

英文摘要

Large audio-language models (LALMs) increasingly use explicit reasoning traces for complex audio understanding, yet the evaluation of reasoning quality remains underexplored. Although process-level benchmarks for process reward models (PRMs) have advanced reasoning evaluation in text and multi-modal domains, comparable evaluation for audio reasoning remains limited. In this paper, we present AudioProcessBench, a comprehensive benchmark for step-level process error identification in audio reasoning. AudioProcessBench contains diverse reasoning traces generated by 6 audio and omni language models. Each trace is segmented into discrete reasoning steps and annotated with binary step correctness and fine-grained error types. Our benchmark evaluates models under three complementary paradigms: (1) step correctness identification, (2) error-type-conditioned detection for diagnosing audio-specific verifier capacities, and (3) chain-level aggregation, where verifiers select or aggregate among multiple reasoning traces for the same question. This design enables a systematic analysis of whether current models can detect process errors, whether their weaknesses differ across audio-specific error types, and whether process verification translates into improved answer selection. AudioProcessBench provides a testbed for future research on audio reasoning verifiers, process reward models, and reliable omni-modal reasoning.

2606.10911 2026-06-10 cs.SD cs.AI cs.CR cs.LG 新提交

Ethical and Technical Limits of Deepfake Speech Datasets

深度伪造语音数据集的伦理与技术限制

Vojtěch Staněk, Eva Trnovská, Kamil Malinka, Anton Firc

发表机构 * Brno University of Technology(布尔诺理工大学)

AI总结 通过审计39个深度伪造语音数据集,发现公平性评估因缺乏人口统计元数据而不可行,且数据集间真实语音源语料库重叠严重,影响跨数据集评估的可靠性。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

关于深度伪造语音检测器的鲁棒性和公平性的声明,其可信度仅与用于训练和评估这些系统的数据集相当。我们对深度伪造语音领域进行了数据集级别的审计。我们整理并分析了39个深度伪造语音数据集,检查了关键属性,包括可访问性、文档、人口统计和语言覆盖范围、数据集规模以及底层的真实语音来源。我们的审计揭示了两个重要的发现。首先,公平性评估在很大程度上不可行,因为大多数数据集缺乏人口统计元数据,只有少数包含性别或语言标签。这阻止了任何有意义的子组分析,并使得其他人口统计属性未被处理。其次,我们识别出不同数据集之间底层的真实语音源语料库存在大量重叠,这可能破坏跨数据集评估,并导致对泛化能力的夸大声称。

英文摘要

Claims about the robustness and fairness of deepfake speech detectors are only as credible as the datasets used to train and evaluate those systems. We present a dataset-level audit of the deepfake speech landscape. We compile and analyze 39 deepfake speech datasets, examining key attributes including accessibility, documentation, demographic and language coverage, dataset scale, and the underlying bona fide speech sources. Our audit reveals two important takeaways. Firstly, fairness assessment is largely infeasible because most datasets lack demographic metadata, and only a few contain gender or language labels. This prevents any meaningful subgroup analysis and leaves other demographic attributes unaddressed. Secondly, we identify substantial overlap in underlying bona fide source corpora across datasets, which can undermine cross-dataset evaluation and lead to overstated generalization claims.

2606.10010 2026-06-10 eess.AS cs.AI cs.MM cs.SD 交叉投稿

DeRA-MOS: Optimizing Text-to-Music Evaluation via Decoupled Listwise Ranking and Modality Alignment

DeRA-MOS:通过解耦列表排序和模态对齐优化文本到音乐评估

Chien-Chun Wang, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen

发表机构 * E.SUN Financial Holding Co., Ltd.(E.SUN财务控股公司) United Link Co., Ltd.(联合链接有限公司) Institute of Information Science, Academia Sinica(学术院信息科学研究所) Department of Computer Science and Information Engineering, National Taiwan Normal University(台湾师范大学计算机科学与信息工程系)

AI总结 提出DeRA-MOS解耦优化框架,通过批感知列表排序损失和分数锚定模态对齐损失,分别优化音乐印象和文本对齐的排名指标,在MusicEval上显著提升评估性能。

Comments Accepted to IEEE Signal Processing Letters (SPL)

详情
AI中文摘要

评估文本到音乐(TTM)系统仍然昂贵,因为音乐印象(MI)和文本对齐(TA)分数依赖于人类平均意见分数(MOS)。大多数自动MOS估计器采用逐点回归或分布分类训练。这些目标不直接优化基于排名的指标,并且为跨模态一致性提供较弱的几何约束。为了解决这些问题,我们提出了DeRA-MOS,一种用于TTM评估的解耦优化框架。对于MI,我们引入了一种批感知列表排序损失,该损失对每个小批量内的相对顺序进行建模,并更好地与基于Spearman秩相关系数(SRCC)的评估对齐。对于TA,我们引入了一种分数锚定的模态对齐损失,将人类分数映射到目标音频-文本相似度,并在融合前正则化潜在空间。通过有效缓解逐点训练不匹配和模态漂移,MusicEval上的实验表明,我们的解耦框架在MI和TA排名指标上均取得了显著改进,为大规模TTM评估建立了稳健的范式。

英文摘要

Evaluating text-to-music (TTM) systems remains expensive because music impression (MI) and text alignment (TA) scores rely on human mean opinion scores (MOS). Most automatic MOS estimators are trained with point-wise regression or distributional classification. These objectives do not directly optimize rank-based metrics and provide weak geometric constraints for cross-modal coherence. To address these gaps, we propose DeRA-MOS, a decoupled optimization framework for TTM evaluation. For MI, we introduce a batch-aware listwise ranking loss that models relative order within each mini-batch and better aligns with evaluation based on Spearman's rank correlation coefficient (SRCC). For TA, we introduce a score-anchored modality alignment loss that maps human scores to target audio-text similarity and regularizes the latent space before fusion. By effectively mitigating the point-wise training mismatch and modality drift, experiments on MusicEval demonstrate that our decoupled framework yields substantial improvements in both MI and TA ranking metrics, establishing a robust paradigm for large-scale TTM evaluation.

2603.11482 2026-06-10 cs.SD cs.CL eess.AS 版本更新

AnimeScore: A Preference-Based Dataset and Framework for Evaluating Anime-Like Speech Style

AnimeScore: 基于偏好的数据集与框架用于评估动漫风格语音

Joonyong Park, Jerry Li

发表机构 * Spellbrush, USA(美国Spellbrush)

AI总结 针对动漫风格语音缺乏客观评估指标的问题,提出基于偏好排序的框架AnimeScore,通过187名评估者的15000对判断数据,利用声学分析和SSL排序模型实现高达90.8% AUC的自动评估。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

目前评估“动漫风格”语音依赖于昂贵的主观判断,尚无标准化的客观指标。一个关键挑战在于,与自然度不同,动漫相似度缺乏共享的绝对尺度,使得传统的平均意见得分(MOS)协议不可靠。为填补这一空白,我们提出AnimeScore,一个基于偏好的框架,通过成对排序自动评估动漫相似度。我们收集了来自187名评估者的15000对成对判断,并附有自由形式的描述;声学分析表明,感知的动漫相似度由受控的共振峰塑造、韵律连续性和刻意发音驱动,而非简单的启发式规则如高音调。我们证明,手工设计的声学特征达到69.3%的AUC上限,而基于SSL的排序模型达到90.8%的AUC,提供了一个实用的度量标准,也可作为生成式语音模型基于偏好优化的奖励信号。

英文摘要

Evaluating 'anime-like' voices currently relies on costly subjective judgments, yet no standardized objective metric exists. A key challenge is that anime-likeness, unlike naturalness, lacks a shared absolute scale, making conventional Mean Opinion Score (MOS) protocols unreliable. To address this gap, we propose AnimeScore, a preference-based framework for automatic anime-likeness evaluation via pairwise ranking. We collect 15,000 pairwise judgments from 187 evaluators with free-form descriptions, and acoustic analysis reveals that perceived anime-likeness is driven by controlled resonance shaping, prosodic continuity, and deliberate articulation rather than simple heuristics such as high pitch. We show that handcrafted acoustic features reach a 69.3% AUC ceiling, while SSL-based ranking models achieve up to 90.8% AUC, providing a practical metric that can also serve as a reward signal for preference-based optimization of generative speech models.

11. 安全、隐私与深度伪造音频 7 篇

2606.10223 2026-06-10 cs.SD cs.AI cs.CV 新提交

Dual-Branch Gated Fusion for Open-Set Audio Deepfake Source Tracing

双分支门控融合用于开放集音频深度伪造源追踪

Awais Khan, Kutub Uddin, Khalid Malik

AI总结 针对开放集音频深度伪造源追踪问题,提出双分支门控融合框架,结合XLSR-53和CORES描述符,通过输入条件门控自适应加权,实现域内高精度和域外鲁棒泛化。

详情
AI中文摘要

将合成语音归因于其原始系统仍然是一个开放挑战:闭集模型无法拒绝未见过的合成器并产生过度自信的预测。为了解决这个问题,我们提出了一个双分支门控融合框架,将XLSR-53与CORES配对,CORES是一个66维描述符,与之前仅使用线性滤波器组(LFB)的工作不同,它跨越倒谱、振荡、节奏、能量和频谱维度,以捕获互补的合成伪影。我们的分析表明,XLSR-53在域内(ID)保持判别性,而CORES在分布偏移(OOD)下稳定泛化,但由于SSL表示不平衡,它们的简单拼接失败。为了解决这个问题,一个输入条件门控在联合训练下自适应地加权每个分支,使用交叉熵、用于ID/OOD分离的能量边际损失和门控多样性项。在MLAAD基准上,我们的系统实现了97.6%的ID准确率、4.9%的EERc,并且相对于Interspeech 2025基线,FPR95相对降低了83.5%。

英文摘要

Attributing a synthetic utterance to its originating system remains an open challenge: closed-set models fail to reject unseen synthesizers and produce overconfident predictions. To address this, we propose a dual-branch gated fusion framework that pairs XLSR-53 with CORES, a 66-dimensional descriptor that, unlike prior Linear Filter Bank (LFB)-only work, spans cepstral, oscillatory, rhythmic, energy, and spectral dimensions to capture complementary synthesis artifacts. Our analysis shows XLSR-53 remains discriminative in-domain (ID) while CORES generalizes stably under distribution shift (OOD), yet their naive concatenation fails due to SSL representational imbalance. To resolve this, an input-conditioned gate adaptively weights each branch under joint training with cross-entropy, an energy margin loss for ID/OOD separation, and a gate diversity term. On the MLAAD benchmark, our system achieves 97.6\% ID accuracy, 4.9\% EERc, and an 83.5\% relative FPR95 reduction over the Interspeech 2025 baseline.

2606.10246 2026-06-10 cs.SD cs.AI cs.LG 新提交

Linguistically Augmented Audio Speech Data (LinguAS)

语言增强音频语音数据 (LinguAS)

Ashley R. Keaton, Zahra Khanjani, Christine Mallinson, Vandana P. Janeja

发表机构 * University of Maryland, Baltimore County(马里兰大学巴尔的摩分校)

AI总结 提出LinguAS数据集,通过专家定义的语言特征(EDLFs)增强音频数据,显著提升深度伪造语音检测模型性能。

详情
AI中文摘要

恶意创建的伪造语音,包括深度伪造和欺骗音频,正以惊人速度扩散,检测模型竞相保持领先。然而,大多数检测模型仅基于帧级音频特征进行推理,未利用更大时间尺度上的有价值语言线索。为弥补这一空白,我们提出语言增强音频语音数据(LinguAS),这是一个包含真实和深度伪造音频样本的数据集,标注了五种策略性选择的、专家定义的语言特征(EDLFs),这些特征在英语口语中频繁出现且是自然人类语音的特征。LinguAS包含超过800个音频样本,每个样本都标注了EDLFs。数据集包含四种欺骗音频攻击类型的平衡数量以及相应数量的真实语音样本。我们还包含说话者性别和每个欺骗音频样本的生成器/来源元数据,为模型训练提供更细粒度信息。我们发现,使用EDLFs增强数据训练的模型性能显著超过ASVspoof 2021深度学习基线和HuBert、XLSR等SSL模型。LinguAS增强的语言、性别和生成器元数据为音频深度伪造研究者提供了一个强调真实人类语言特征的数据集,以改进伪造语音的模型推理。数据和代码已公开。

英文摘要

Maliciously-created fake speech, including deepfaked and spoofed audio, is proliferating at an alarming rate, and detection models are racing to stay ahead of the curve. Yet, most detection models are trained to make inference on frame-level audio features alone without leveraging valuable linguistic cues at larger timescales. To address this gap, we present Linguistically Augmented Audio Speech Data (LinguAS), a dataset of genuine and deepfaked audio samples annotated with five strategically-chosen, Expert-Defined Linguistic Features (EDLFs) that occur frequently in spoken English and are characteristic of natural human speech. LinguAS contains over 800 audio samples, each of which are annotated with EDLFs. The dataset has a balanced number of four spoofed audio attack types and a proportionate number of genuine speech samples. We also include metadata on speaker gender and the generator/source for each spoofed audio sample, offering more granularity for model training. We found that models trained on data augmented with EDLFs had improved model performance significantly beyond the ASVspoof 2021 deep learning baselines and SSL models like HuBert and XLSR. LinguAS's augmented linguistic, gender, and generator metadata provide audio deepfake researchers with a dataset that emphasizes real human language traits to improve model inference of faked speech. Data and code are publicly available.

2606.10791 2026-06-10 cs.SD 新提交

Overview of ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge

ESDD2概述:环境感知语音与声音深度伪造检测挑战赛

Xueping Zhang, Han Yin, Yang Xiao, Lin Zhang, Ting Dang, Rohan Kumar Das, Ming Li

发表机构 * Duke Kunshan University(昆山杜克大学) Korea Advanced Institute of Science and Technology(韩国科学技术院) The University of Melbourne(墨尔本大学) Johns Hopkins University(约翰霍普金斯大学) Fortemedia Singapore(Fortemedia新加坡) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 介绍ESDD2挑战赛,评估语音和环境声音独立或联合操纵的检测系统,最佳系统Macro-F1达0.8775,模块化分解、跨域自监督编码器、数据增强和选择性集成是关键。

Comments Accepted to 2026 ICME workshop

详情
AI中文摘要

与ICME 2026联合举办的环境感知语音与声音深度伪造检测挑战赛(ESDD2)评估了五个组件级别的音频欺骗检测系统,其中语音和环境声音可能被独立或联合操纵。挑战结束后,我们分析了最终排行榜,并总结了来自顶级提交的有效设计选择。该挑战吸引了来自16个国家的94个注册;在验证提交要求和元数据后,保留了13个团队进行最终分析。在测试集上,最佳系统实现了0.8775的Macro-F1分数,显著优于分离增强的联合学习基线(0.6327)。顶级系统一致受益于模块化任务分解、跨域自监督编码器、针对性数据增强和选择性集成,而非简单的模型缩放。同时,辅助EER分析揭示了在检测伪造环境组件以及泛化到测试集中未见生成器方面的持续困难。本文报告了挑战结果,并为未来环境感知深度伪造检测研究提供了见解。CompSpoofV2数据集和基线代码仍公开可用,以促进可重复性。

英文摘要

The Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), held in conjunction with ICME 2026, evaluated systems for five component-level audio spoofing detection, where speech and environmental sounds may be manipulated independently or jointly. After the challenge concludes, we analyze the final leaderboard and summarize effective design choices from the top-performing submissions. The challenge attracted 94 registrations from 16 countries; after verification of submission requirements and metadata, 13 teams were retained for the final analysis. On the test set, the best system achieved a Macro-F1 score of 0.8775, substantially outperforming the separation-enhanced joint learning baseline (0.6327). Top systems consistently benefited from modular task decomposition, cross-domain self-supervised encoders, targeted data augmentation, and selective ensembling rather than simple model scaling. At the same time, auxiliary EER analyses reveal persistent difficulty in detecting the spoofed environmental component and in generalizing to unseen generators in the test set. This paper reports challenge results and provides insights for future environment-aware deepfake detection research. The CompSpoofV2 dataset and baseline code remain publicly available for reproducibility.

2606.10908 2026-06-10 cs.SD cs.AI cs.CR cs.LG 新提交

RAT: Reference-Augmented Training for ASV Anti-Spoofing

RAT:面向ASV反欺骗的参考增强训练

Vojtěch Staněk, Anton Firc, Jakub Reš, Kamil Malinka

发表机构 * Brno University of Technology(布尔诺理工大学)

AI总结 提出一种基于说话人参考录音的欺骗对抗架构,发现训练时引入参考通道可提升深度伪造检测性能,即使推理时参考缺失或失配。基于此提出参考增强训练(RAT)策略,在ASVspoof 5基准上以单个检测器达到2.57% EER和0.074 minDCF,超越大型集成系统。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

我们引入了一种以说话人参考录音为条件的欺骗对抗架构,但观察到它收敛到一种在推理时有效忽略参考的解决方案。令人惊讶的是,使用参考通道进行训练会诱导出不变性,从而改进深度伪造检测,即使在推理时参考缺失或失配。基于这一观察,我们提出了一种参考增强训练(RAT)策略。与单话语基线相比,RAT产生了改进的检测性能,即使在推理时将参考录音替换为零向量时也是如此。通过严格分析,我们证明优化过程迅速减少了参考贡献,导致推理很大程度上独立于参考通道。使用RAT,我们在ASVspoof 5基准上以单个检测器实现了最先进的2.57%等错误率和0.074最小检测代价函数,甚至超越了大型集成系统。

英文摘要

We introduce a spoofing countermeasure architecture conditioned on speaker-reference recordings, but observe that it converges to a solution that effectively ignores the reference during inference. Surprisingly, training with a reference channel induces invariance that improves deepfake detection, even when the reference is absent or mismatched during inference. Based on this observation, we propose a Reference-Augmented Training (RAT) strategy. RAT yields improved detection performance compared to single-utterance baselines, even when the reference recording is replaced with a zero vector at inference. Through rigorous analysis, we demonstrate that the optimization process rapidly diminishes the reference contributions, leading to inference largely independent of the reference channel. Using RAT, we achieve state-of-the-art 2.57% EER and 0.074 minDCF on the ASVspoof 5 benchmark with a single detector, surpassing even large ensemble systems.

2606.10912 2026-06-10 cs.SD cs.AI cs.CR cs.LG 新提交

What Do Deepfake Speech Detectors Actually Hear?

深度伪造语音检测器实际上听到了什么?

Vojtěch Staněk, Veronika Jirmusová, Anton Firc, Kamil Malinka, Jakub Reš, Martin Perešíni

发表机构 * Brno University of Technology(布尔诺理工大学)

AI总结 提出基于自监督表示和积分梯度的可解释性方法,分析三种WavLM检测器在ASVspoof5上的决策线索,发现它们分别依赖环境噪声、音素伪影和词边界。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

深度伪造语音检测器通常输出一个分数,而不解释为什么音频样本被标记、证据在信号中的位置或哪些线索驱动了决策。我们提出了一种音频原生的可解释性管道,使用时间对齐的自监督表示上的积分梯度来随时间定位决策证据。我们将所提出的方法应用于ASVspoof5上的三个基于WavLM的检测器(AASIST、CA-MHFA、SLS),并手动注释最高归因区域以提供最重要线索的语义含义。尽管性能相似,检测器依赖不同的线索:AASIST强调非语音/环境线索,CA-MHFA关注局部音素伪影,SLS依赖词边界和频谱完整性。我们超越推测性推理,通过因果遮蔽主要检测器线索来验证我们的发现。观察到的性能下降进一步支持了解释的检测器语义。

英文摘要

Deepfake speech detectors often output a single score without explaining why an audio sample is flagged, where in the signal the evidence lies, or what cues drive the decision. We propose an audio-native explainability pipeline using Integrated Gradients on time-aligned self-supervised representations to localize decision evidence over time. We apply the proposed method to three WavLM-based detectors (AASIST, CA-MHFA, SLS) on ASVspoof 5 and manually annotate the highest-attribution regions to provide a semantic meaning of the most important cues. Despite similar performance, the detectors rely on different cues: AASIST emphasizes non-speech/environment cues, CA-MHFA focuses on localized phoneme artifacts, and SLS relies on word boundaries and spectral integrity. We move beyond speculative reasoning and validate our findings by causal masking of the primary detector cues. Observed performance degradation further supports the explained detector semantics.

2606.06037 2026-06-10 cs.SD cs.CL eess.AS 新提交

SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech

SpeechJBB:探究大型音频语言模型在代码切换语音下的安全对齐与理解

Virginia Ceccatelli, Yejin Jeon, David Ifeoluwa Adelani

发表机构 * Mila - Quebec AI Institute(魁北克AI研究所) McGill University(麦吉尔大学) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 提出SpeechJBB数据集,通过代码切换有害音频和伪词插入方法,揭示大型音频语言模型在多语言和口语设置下的安全漏洞。

详情
AI中文摘要

大型音频语言模型(LALMs)越来越多地部署在现实应用中,但其安全对齐仍主要在单语、基于文本的有害提示上进行评估。这导致其在多语言和口语设置,特别是代码切换语音下的泛化能力很大程度上未被探索。为填补这一空白,我们引入了SpeechJBB,一个用于对多种最先进LALMs进行基准测试的音频越狱数据集。通过引入一种增强设置,即在安全关键术语周围插入音位学上合理的伪词以模拟局部混淆,进一步探测了安全弱点的程度。跨模型而言,代码切换的有害音频产生了显著高的越狱成功率(JSR),其中非英语单语和非英语代码切换对表现出最高的攻击成功率。伪词插入进一步降低了拒绝率,表明听起来自然的混淆可以有效绕过安全策略。

英文摘要

Large audio language models (LALMs) are increasingly deployed in real-world applications, yet their safety alignment is still primarily evaluated on monolingual, text-based harmful prompts. This leaves their generalizability under multilingual and spoken settings, particularly code-switched speech, largely underexplored. To address this gap, we introduce SpeechJBB, an audio jailbreak dataset for benchmarking across multiple state-of-the-art LALMs. The extent of safety weaknesses is further probed by introducing an augmented setting where phonologically plausible pseudo-words are inserted around safety-critical terms to simulate localized obfuscation. Across models, code-switched harmful audio yields substantially high jailbreak success rates (JSR), with non-English monolingual and non-English code-switched pairs exhibiting the highest attack success. Pseudo-word insertion further reduces refusal rates, which demonstrates that natural-sounding obfuscation can effectively bypass safety policies.

2512.07352 2026-06-10 cs.SD 版本更新

MultiAPI Spoof: A Multi-API Dataset and Local-Attention Network for Speech Anti-spoofing Detection

MultiAPI Spoof:用于语音反欺骗检测的多API数据集和局部注意力网络

Xueping Zhang, Zhenshan Zhang, Yechen Wang, Linxi Li, Liwei Jin, Ming Li

发表机构 * Duke Kunshan University(杜克大学昆山分校) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) OfSpectrum, Inc.(OfSpectrum公司) Digital Innovation Research Center(数字创新研究中心) School of Artificial Intelligence(人工智能学院)

AI总结 针对现有反欺骗基准与真实场景差距大的问题,构建包含30种API生成约230小时合成语音的多API数据集,并提出局部注意力增强网络Nes2Net-LA,实现最先进性能与强鲁棒性。

Comments Accept to Interspeech 2026

详情
AI中文摘要

现有的语音反欺骗基准依赖于一组狭窄的公共模型,与商业系统使用多样化、通常专有API的真实场景之间存在巨大差距。为解决这一问题,我们引入了MultiAPI Spoof,一个多API音频反欺骗数据集,包含由30种不同API(包括商业服务、开源模型和在线平台)生成的约230小时合成语音。此外,我们提出了Nes2Net-LA,一种局部注意力增强的Nes2Net变体,改进了局部上下文建模和细粒度欺骗特征提取。基于该数据集,我们还定义了API追踪任务,能够对欺骗音频进行细粒度的生成源归因。实验表明,Nes2Net-LA实现了最先进的性能,并在多样化和未见过的欺骗条件下表现出卓越的鲁棒性。代码和数据集已发布。

英文摘要

Existing speech anti-spoofing benchmarks rely on a narrow set of public models, creating a substantial gap from real-world scenarios in which commercial systems employ diverse, often proprietary APIs. To address this issue, we introduce MultiAPI Spoof, a multi-API audio anti-spoofing dataset comprising about 230 hours of synthetic speech generated by 30 distinct APIs, including commercial services, open-source models, and online platforms. Furthermore, we propose Nes2Net-LA, a local-attention enhanced variant of Nes2Net that improves local context modeling and fine-grained spoofing feature extraction. Based on this dataset, we also define the API tracing task, enabling fine-grained attribution of spoofed audio to its generation source. Experiments show that Nes2Net-LA achieves state-of-the-art performance and offers superior robustness, particularly under diverse and unseen spoofing conditions. Code \footnote{https://github.com/XuepingZhang/MultiAPI-Spoof} and dataset \footnote{https://xuepingzhang.github.io/MultiAPI-Spoof-Dataset/} have been released.

12. 其他/综合语音音频 4 篇

2603.21050 2026-06-10 cs.SD 版本更新

ERM-MinMaxGAP: Benchmarking and Mitigating Gender Bias in Multilingual Multimodal Speech-LLM Emotion Recognition

ERM-MinMaxGAP:多语言多模态语音-LLM情感识别中的性别偏见基准测试与缓解

Zi Haur Pang, Xiaoxue Gao, Tatsuya Kawahara, Nancy F. Chen

发表机构 * Kyoto University, Japan(京都大学,日本) Agency for Science, Technology, and Research (A*STAR), Singapore(科技研究局(A*STAR),新加坡)

AI总结 针对多语言语音大模型在情感识别中的性别偏见问题,提出基于MELD-ST的多语言多模态基准,并设计ERM-MinMaxGAP训练目标,通过自适应公平权重和MinMaxGAP正则化器,在英日德三种语言上提升性能并缩小性别差距。

Comments This paper has been accepted for presentation at INTERSPEECH 2026

详情
AI中文摘要

语音情感识别(SER)系统可能表现出与性别相关的性能差异,但这种偏见如何在跨语言和跨模态的多语言语音大模型中体现尚不清楚。我们引入了一个基于MELD-ST的新型多语言多模态基准,涵盖英语、日语和德语,以量化特定语言的SER性能和性别差距。我们发现偏见强烈依赖于语言,并且多模态融合并不能可靠地提高公平性。为了解决这些问题,我们提出了ERM-MinMaxGAP,一种公平性感知的训练目标,它通过提出的自适应公平权重机制和一种新颖的MinMaxGAP正则化器(针对每种语言和模态内的最大男女损失差距)来增强经验风险最小化(ERM)。基于Qwen2-Audio骨干网络,我们的ERM-MinMaxGAP方法在单模态和多模态设置下分别将多语言SER性能提高了5.5%和5.0%,同时将整体性别偏见差距减少了0.1%和1.4%。

英文摘要

Speech emotion recognition (SER) systems can exhibit gender-related performance disparities, but how such bias manifests in multilingual speech LLMs across languages and modalities is unclear. We introduce a novel multilingual, multimodal benchmark built on MELD-ST, spanning English, Japanese, and German, to quantify language-specific SER performance and gender gaps. We find bias is strongly language-dependent, and multimodal fusion does not reliably improve fairness. To address these, we propose ERM-MinMaxGAP, a fairness-informed training objective, which augments empirical risk minimization (ERM) with a proposed adaptive fairness weight mechanism and a novel MinMaxGAP regularizer on the maximum male-female loss gap within each language and modality. Building upon the Qwen2-Audio backbone, our ERM-MinMaxGAP approach improves multilingual SER performance by 5.5% and 5.0% while reducing the overall gender bias gap by 0.1% and 1.4% in the unimodal and multimodal settings, respectively.

2507.19137 2026-06-10 eess.AS cs.AI cs.SD 版本更新

Assessment of Personality Dimensions Across Situations in Dyadic Role-Play Scenarios

二元角色扮演场景中跨情境的人格维度评估

Alice Zhang, Skanda Muralidhar, Daniel Gatica-Perez, Mathew Magimai-Doss

发表机构 * Idiap Research Institute(日内瓦研究所) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 研究通过对话语音分析,发现感知人格在不同工作情境下显著变化,并识别出与各人格特质相关的声学特征。

详情
AI中文摘要

先前研究表明,用户偏好与其人格相匹配的辅助技术。这引发了对自动人格感知(APP)的兴趣,旨在预测个体感知到的人格特质。以往的APP研究将人格视为静态特质,独立于情境。然而,心理学研究表明,感知人格会随情境和场景而变化。在本研究中,我们调查了参与两种工作情境(中性面试和压力客户互动)的参与者对话语音与感知人格之间的关系。我们的主要发现是:1)感知人格在不同互动中显著不同;2)响度、声压级和频谱通量特征在中性互动中指示感知的外向性、宜人性、尽责性和开放性,而在压力情境中,神经质与这些特征相关;3)手工声学特征和非语言特征在感知人格推断中优于说话人嵌入;4)压力互动更能预测神经质,这与现有心理学研究一致。

英文摘要

Prior research indicates that users prefer assistive technologies whose personalities align with their own. This has sparked interest in automatic personality perception (APP), which aims to predict an individual's perceived personality traits. Previous studies in APP have treated personalities as static traits, independent of context. However, perceived personalities can vary by context and situation as shown in psychological research. In this study, we investigate the relationship between conversational speech and perceived personality for participants engaged in two work situations (a neutral interview and a stressful client interaction). Our key findings are: 1) perceived personalities differ significantly across interactions, 2) loudness, sound level, and spectral flux features are indicative of perceived extraversion, agreeableness, conscientiousness, and openness in neutral interactions, while neuroticism correlates with these features in stressful contexts, 3) handcrafted acoustic features and non-verbal features outperform speaker embeddings in inference of perceived personality, and 4) stressful interactions are more predictive of neuroticism, aligning with existing psychological research.

2509.22148 2026-06-10 eess.AS cs.SD 版本更新

Speaker Anonymisation for Speech-based Suicide Risk Detection

针对基于语音的自杀风险检测的说话人匿名化

Ziyun Cui, Sike Jia, Yang Lin, Yinan Duan, Diyang Qu, Runsen Chen, Chao Zhang, Chang Lei, Wen Wu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文研究了语音基于自杀风险检测的说话人匿名化,评估了传统信号处理、神经语音转换和语音合成等方法的平衡,结果表明保留互补信息的匿名化方法能保持检测性能。

Comments Accepted by ICASSP 2026

详情
AI中文摘要

青少年自杀是全球卫生问题,语音为自动自杀风险检测提供了成本效益高的模态。鉴于易受伤害的人群,保护说话人身份尤为重要,因为语音本身如果数据泄露或被恶意利用,可能泄露个人身份信息。本文首次系统研究了语音基于自杀风险检测的说话人匿名化。调查了广泛匿名化方法,包括基于传统信号处理、神经语音转换和语音合成的技术。构建了全面的评估框架,以评估保护说话人身份与保留对自杀风险检测至关重要的信息之间的权衡。结果表明,结合保留互补信息的匿名化方法可实现与原始语音相当的检测性能,同时保护易受伤害人群的说话人身份。

英文摘要

Adolescent suicide is a critical global health issue, and speech provides a cost-effective modality for automatic suicide risk detection. Given the vulnerable population, protecting speaker identity is particularly important, as speech itself can reveal personally identifiable information if the data is leaked or maliciously exploited. This work presents the first systematic study of speaker anonymisation for speech-based suicide risk detection. A broad range of anonymisation methods are investigated, including techniques based on traditional signal processing, neural voice conversion, and speech synthesis. A comprehensive evaluation framework is built to assess the trade-off between protecting speaker identity and preserving information essential for suicide risk detection. Results show that combining anonymisation methods that retain complementary information yields detection performance comparable to that of original speech, while achieving protection of speaker identity for vulnerable populations.

2212.04930 2026-06-10 eess.AS cs.HC cs.LG cs.SD 版本更新

DDSupport: Language Learning Support System that Displays Differences and Distances from Model Speech

DDSupport: 一种展示差异和距离的语言学习支持系统

Kazuki Kawamura, Jun Rekimoto

发表机构 * The University of Tokyo, Tokyo, Japan(东京大学) Sony CSL Kyoto, Kyoto, Japan(索尼CSL京都)

AI总结 本文提出DDSupport系统,通过小规模未标注语音数据计算学习者发音评分和错误识别,以直观方式展示学习者与模型发音的差异和距离,帮助非母语者提升英语口语清晰度。

详情
Journal ref
2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)
AI中文摘要

当初学者学习非母语发音时,他们难以自行判断发音是否良好。因此,计算机辅助发音训练系统被用来检测学习者的误发音。这些系统通常将用户发音与特定母语者的发音进行比较,以节奏、音素或单词为单位计算差异。然而,它们需要大量详细标注的语音数据或只能比较单一特定母语者。为克服这些问题,我们提出了一种新的语言学习支持系统,该系统基于少量未标注语音数据计算发音评分和检测初学者的误发音,而无需与特定个体比较。所提出的系统使用基于深度学习的语音处理技术,以直观的方式显示学习者发音的评分以及学习者与一组模型发音之间的差异/距离。学习者可以通过消除差异并缩短与模型的距离来逐步提高发音。此外,由于发音评分和差异/距离不是基于特定模型的特定句子计算的,用户可以自由选择他们想学习的句子。我们还构建了一个应用程序来帮助非母语者学习英语,并确认它可以提高用户的语音可懂度。

英文摘要

When beginners learn to speak a non-native language, it is difficult for them to judge for themselves whether they are speaking well. Therefore, computer-assisted pronunciation training systems are used to detect learner mispronunciations. These systems typically compare the user's speech with that of a specific native speaker as a model in units of rhythm, phonemes, or words and calculate the differences. However, they require extensive speech data with detailed annotations or can only compare with one specific native speaker. To overcome these problems, we propose a new language learning support system that calculates speech scores and detects mispronunciations by beginners based on a small amount of unannotated speech data without comparison to a specific person. The proposed system uses deep learning--based speech processing to display the pronunciation score of the learner's speech and the difference/distance between the learner's and a group of models' pronunciation in an intuitively visual manner. Learners can gradually improve their pronunciation by eliminating differences and shortening the distance from the model until they become sufficiently proficient. Furthermore, since the pronunciation score and difference/distance are not calculated compared to specific sentences of a particular model, users are free to study the sentences they wish to study. We also built an application to help non-native speakers learn English and confirmed that it can improve users' speech intelligibility.