arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 语音识别与关键词检测 2 篇

2606.16595 2026-06-16 cs.SD cs.AI 新提交

ArtNet: A JEPA-Like Articulatory Predictive Framework for Robust Zero-Shot Phoneme Recognition

ArtNet:一种类似JEPA的发音预测框架,用于鲁棒的零样本音素识别

Zeqian Hu, Fuliang Weng, Shu Shang, Yaqian Zhou

发表机构 * Fudan University(复旦大学) Pedawise

AI总结 提出ArtNet框架,通过基于发音特征的结构化预测任务和变分信息瓶颈抑制语言特定变化,在零样本跨语言音素识别中实现20.56%的音素错误率降低。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

零样本跨语言音素识别常因直接声学到符号映射的脆弱性而受阻,该映射易受语言特定变化影响。借鉴视觉中的联合嵌入预测架构(JEPA)工作,我们提出ArtNet,一个探索基于发音特征的结构化特征预测任务以增强声学鲁棒性的框架。具体而言,ArtNet集成了一个发音预测器,旨在从自监督学习(SSL)特征中提取通用发音表示,并采用变分信息瓶颈(VIB)抑制语言特定变化。在七种未见语言上的实验表明,ArtNet,特别是与所提出的向量空间库存对齐(VSIA)策略协同使用时,显著优于竞争基线,实现了音素错误率(PER)相对降低20.56%,音素特征错误率(PFER)相对降低7.01%。

英文摘要

Zero-shot cross-lingual phoneme recognition is often hindered by the fragility of direct acoustic-to-symbol mapping, which is susceptible to language-specific variations. Echoing joint-embedding predictive architecture (JEPA) work in vision, we propose ArtNet, a framework that explores a structured feature prediction task based on articulatory features to enhance acoustic robustness. Specifically, ArtNet integrates an articulatory predictor, designed to extract universal articulatory representations from self-supervised learning (SSL) features, with a variational information bottleneck (VIB) to suppress language-specific variations. Experiments on seven unseen languages demonstrate that ArtNet, particularly when synergized with the proposed vector-space inventory alignment (VSIA) strategy, significantly outperforms competitive baselines, achieving a 20.56\% relative reduction in phoneme error rate (PER) and 7.01\% in phoneme feature error rate (PFER).

2605.03297 2026-06-16 cs.SD cs.LG 版本更新

Contrastive Regularization for Accent-Robust ASR

对比正则化用于口音鲁棒的ASR

Van-Phat Thai, Aradhya Dhruv, Duc-Thinh Pham, Sameer Alam

发表机构 * Air Traffic Management Research Institute, Nanyang Technological University, Singapore(新加坡南洋理工大学航空交通管理研究所) Center of AI Research, VinUniversity, Vietnam(越南Vin大学人工智能研究中心)

AI总结 提出使用监督对比学习作为轻量级口音不变辅助目标,在CTC微调中正则化编码器表示,无需架构修改或显式口音监督,在L2-ARCTIC基准上实现高达25-29%的未见口音词错误率降低。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

基于自监督声学预训练和CTC微调的ASR系统在母语语音上表现强劲,但对口音变化仍然敏感。我们研究监督对比学习(SupCon)作为CTC微调的轻量级、口音不变辅助目标。一个话语级对比损失正则化编码器表示,无需架构修改或显式口音监督。在L2-ARCTIC基准上的实验表明,多个预训练编码器均实现一致的WER降低,在未见口音评估下相对降低高达25-29%。使用转录内余弦离散度分析表明,SupCon在口音变化下促进更紧凑和稳定的表示几何结构。总体而言,SupCon提供了一种有效且模型无关的正则化策略,用于提高口音鲁棒性。

英文摘要

ASR systems based on self-supervised acoustic pretraining and CTC fine-tuning achieve strong performance on native speech but remain sensitive to accent variability. We investigate supervised contrastive learning (SupCon) as a lightweight, accent-invariant auxiliary objective for CTC fine-tuning. An utterance-level contrastive loss regularizes encoder representations without architectural modification or explicit accent supervision. Experiments on the L2-ARCTIC benchmark show consistent WER reductions across multiple pretrained encoders, with up to 25 -- 29\% relative reduction under unseen-accent evaluation. Analysis using within-transcript cosine dispersion indicates that SupCon promotes more compact and stable representation geometry under accent variability. Overall, SupCon provides an effective and model-agnostic regularization strategy for improving accent robustness.

2. 语音合成与声音生成 12 篇

2606.14922 2026-06-16 cs.SD cs.AI cs.CL eess.AS 新提交

An Empirical Study on Learning Latent Representations for Emotional Speech Synthesis

情感语音合成中学习潜在表示的实证研究

Vinh Dang Quang, Huy Ngo Quang

发表机构 * Aimesoft JSC

AI总结 本文针对VLSP 2022情感语音合成任务,通过将说话人嵌入和韵律瓶颈集成到FastSpeech 2中,实现了单说话人情感语音生成及跨说话人风格迁移。

Comments 4 pages

详情
AI中文摘要

在过去的几年中,由于深度学习,语音合成领域取得了巨大进步。越来越多的基于深度学习的TTS系统被开发出来,使得生成具有高可懂度和自然度的语音成为可能。同时,控制表现力仍然是一个大问题,以不同风格或方式生成语音最近受到了社区的广泛关注。本文旨在为VLSP 2022的情感语音合成(ESS)任务提供我们的解决方案,该任务允许从给定的输入文本生成具有所需情感表达的自然人声。通过将说话人嵌入、韵律瓶颈集成到FastSpeech 2中,我们的系统有望生成单个说话人的情感语音(子任务1),并将另一个说话人的说话风格迁移到具有中性非表达性数据的目标说话人,同时保留目标说话人的身份(子任务2)。

英文摘要

For the last couple of years, the field of speech synthesis has improved dramatically thanks to deep learning. There are more and more deep learning-based TTS systems developed to make it possible to produce voices with high intelligibility and naturalness. Meanwhile, controlling the expressiveness is yet a big deal, generating speech in different styles or manners has received a lot of attention from community recently. This paper aims to give our solutions to deal with the task emotional speech synthesis (ESS) at VLSP 2022 which allows to generate humanlike natural-sounding voice from a given input text with desired emotional expression. By integrating speaker embedding, prosody bottleneck into FastSpeech 2, our systems can promisingly generate emotional speech of a single speaker (Sub-task 1), transfer speaking styles from another speaker to the target speaker with neutral non-expressive data while retaining the target speaker's identity (Sub-task 2).

2606.15149 2026-06-16 cs.SD eess.AS 新提交

AUDEDIT: Inversion-Free Text-Guided Editing with Pretrained Audio Flow Models

AUDEDIT: 基于预训练音频流模型的无反演文本引导编辑

Zhongyuan Fu

发表机构 * College of Computer Science, Nankai University(南开大学计算机科学学院)

AI总结 提出AUDEdit,一种无需反演的方法,利用预训练整流流音频生成器实现真实音频的文本引导编辑,通过直接源到目标常微分方程改善文本对齐与音频保真度。

详情
AI中文摘要

我们提出AudEdit,一种无需反演的方法,用于对真实音频进行文本引导编辑,使用预训练的整流流音频生成器。文本到音频系统(如Stable Audio 3)已经通过向输入录音添加噪声并在新提示下进行去噪来实现音频到音频编辑,但这种反演式路线必须在提示遵循与节奏、瞬态、音色和长程音乐结构保持之间进行权衡。受计算机视觉中最近的无反演流编辑启发,我们为一维Stable Audio 3潜变量开发了一种音频特定的直接源到目标常微分方程:在每个流步骤中,我们在共享的随机源边际下比较目标和源条件速度场,并通过它们的差异更新编辑后的潜变量。由此产生的编辑器无需训练、无需成对编辑数据、无需优化,也无需访问内部注意力图。在基于FSD50K和Song Describer Dataset构建的音效和音乐编辑集上,AudEdit在CLAP文本对齐和音频保持方面优于SDEdit、ODE反演和FireFlow;例如,在音效上,它将目标文本CLAP相似度从0.42提高到0.52(相对于最强基线),同时将FAD从65.70降低到50.37。

英文摘要

We introduce AudEdit, an inversion-free method for text-guided editing of real audio with a pretrained rectified-flow audio generator. Text-to-audio systems such as Stable Audio 3 already expose audio-to-audio editing by noising an input recording and denoising it under a new prompt, but this inversion-style route must trade prompt adherence against preservation of rhythm, transients, timbre, and long-range musical structure. Motivated by recent inversion-free flow editing in computer vision, we develop an audio-specific direct source-to-target ordinary differential equation for one-dimensional Stable Audio 3 latents: at each flow step, we compare the target- and source-conditioned velocity fields under a shared stochastic source marginal, and update the edited latent by their difference. The resulting editor requires no training, no paired edit data, no optimization, and no access to internal attention maps. Across sound-effect and music editing sets built from FSD50K and the Song Describer Dataset, AudEdit improves CLAP text alignment and audio preservation over SDEdit, ODE inversion, and FireFlow; for example, on sound effects it raises target-text CLAP similarity from 0.42 to 0.52 over the strongest baseline while reducing FAD from 65.70 to 50.37.

2606.15186 2026-06-16 cs.SD cs.AI eess.AS 新提交

FreeSonic: Training-Free Temporal-Aware Decoupled Attention for Precise Audio Editing

FreeSonic: 无需训练的时序感知解耦注意力用于精确音频编辑

Yuxuan Jiang, Mingyang Han, Yusheng Dai, Andong Wang, Tianhong Zhou, Jiaxin Ye, Dongxiao Wang, Haoxiang Shi, Boyu Li, Jun Song, Cheng Yu, Bo Zheng, Weibei Dou, Zehua Chen, Jun Zhu

发表机构 * Tsinghua University(清华大学) Alibaba Group(阿里巴巴集团) Monash University(蒙纳士大学) Renmin University of China(中国人民大学) Fudan University(复旦大学)

AI总结 提出FreeSonic,一种无需训练的框架,利用基于Rectified Flow的TangoFlux模型,通过优化反转-逆过程、联合文本-音频注意力图以及调度注意力解耦,实现精确且一致的音频编辑,同时保持背景保真度。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

文本到音频(TTA)生成取得了显著进展,但实现精确且一致的音频编辑仍然是一个主要挑战。然而,现有方法难以平衡时间一致性与背景保留。在本文中,我们提出FreeSonic,一个无需训练的框架,利用最先进的基于Rectified Flow的TangoFlux模型。FreeSonic利用优化的反转-逆过程和联合文本-音频注意力图进行精确的目标片段提取。对于内容编辑,一种新颖的调度注意力解耦将修改限制在目标区域,同时保留原始声学上下文。此外,面向任务的噪声注入增强了音频移除和非刚性替换等任务的通用性。大量实验结果表明,FreeSonic通过提供高保真且高效的解决方案,在精确且一致的音频编辑中实现了优越的平衡。项目和演示:https://free-sonic.github.io/

英文摘要

Text-to-audio (TTA) generation has made significant strides, yet achieving precise and consistent audio editing remains a major challenge. However, existing methods struggle to balance temporal consistency with background preservation. In this paper, we propose FreeSonic, a training-free framework leveraging the state-of-the-art Rectified Flow-based TangoFlux model. FreeSonic utilizes an optimized inversion-reverse process and joint text-audio attention maps for precise target segment extraction. For content editing, a novel scheduled attention decoupling confines modifications to target regions while preserving original acoustic context. Furthermore, task-oriented noise injection enhances versatility for tasks such as audio removal and non-rigid replacement. Extensive experimental results demonstrate that FreeSonic achieves a superior balance by providing a high-fidelity and efficient solution for precise and consistent audio editing. Project and demos: https://free-sonic.github.io/

2606.16417 2026-06-16 cs.SD eess.AS 新提交

Joycent: Diffusion-based Accent TTS without Accented Phone Prediction

Joycent: 基于扩散的口音语音合成,无需口音音素预测

Xintong Wang, Ye Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Joycent,一种基于扩散模型的口音TTS方法,直接从标准音素序列和语音参考合成口音语音,无需口音音素预测,通过条件层归一化集成口音和说话人表征,并引入WhisAID口音识别模型,在保持说话人身份的同时提升口音自然度。

详情
AI中文摘要

口音文本到语音(TTS)旨在合成具有目标口音的语音。现有的口音TTS系统通常依赖于两阶段流程,首先将标准音素序列转换为口音音素序列,然后合成口音语音。然而,这种方法存在错误累积问题,并且需要配对的标准-口音音素序列数据,这在实践中往往有限。此外,基于文本的口音音素表示不足以建模韵律和节奏等声学口音特征。在这项工作中,我们提出了Joycent,一种基于扩散的口音TTS模型,它直接从标准音素序列和语音参考合成口音语音,无需口音音素预测。Joycent通过文本编码器中的条件层归一化(CLN)集成口音和说话人表征。我们引入了WhisAID,一种在口音普通话语音上训练的普通话口音识别模型,以提取口音表征。实验结果表明,与基线系统相比,Joycent在保持说话人身份的同时提高了口音自然度。我们在以下网址发布代码和演示:https://github.com/oshindow/Joycent-code。

英文摘要

Accent text-to-speech (TTS) aims to synthesize speech with target accents. Existing accent TTS systems typically rely on a two-stage pipeline that first converts standard phone sequences into accented phone sequences and then synthesizes accented speech. However, such approaches suffer from error accumulation and require paired standard-accented phone sequence data, which is often limited in practice. Moreover, text-based accented phone representations are insufficient to model acoustic accent characteristics such as prosody and rhythm. In this work, we propose Joycent, a diffusion-based accent TTS model that synthesizes accented speech directly from standard phone sequences and speech references without accented phone prediction. Joycent integrates accent and speaker representations through conditional layer normalization (CLN) in the text encoder. We introduce WhisAID, a Mandarin accent identification model trained on accented Mandarin speech to extract accent representations. Experimental results show that Joycent improves accentedness while preserving speaker identity compared with baseline systems. We release our code and demos at: https://github.com/oshindow/Joycent-code.

2606.14750 2026-06-16 eess.AS cs.AI cs.CV cs.SD 交叉投稿

Pixel-TTS: Image based Text Rendering for Robust Text-to-Speech

Pixel-TTS: 基于图像的文字渲染实现鲁棒文本转语音

Adarsh Arigala, Arjun Gangwar, S Umesh, Yova Kementchedjhieva

发表机构 * SPRING Lab, Indian Institute of Technology, Madras, India(SPRING实验室,印度理工学院,马德拉斯,印度) MBZUAI, UAE(MBZUAI,阿联酋)

AI总结 提出Pixel-TTS框架,将文本渲染为图像并通过2D卷积生成嵌入,消除嵌入矩阵扩展,提升对未见字符和拼写变体的鲁棒性,实现零样本泛化。

Comments 5 pages, 4 figures, 4 tables

详情
AI中文摘要

近期基于像素的文本建模进展表明,将文本表示为图像能使模型利用视觉线索进行语言理解。将文本锚定在其视觉形式上,允许具有不同Unicode编码的结构相似字符产生相似的嵌入,从而有益于跨语言和零样本场景。传统的基于文本的方法独立处理每个字符,限制了向未见字符的泛化,并在跨语言适应时需要嵌入扩展。我们提出Pixel-TTS,首个视觉接地语音合成框架。它将文本渲染为图像,并通过2D卷积层投影以生成嵌入。这种设计在微调过程中消除了嵌入矩阵扩展,同时提高了对未见字符和拼写变体的鲁棒性。大量实验表明,Pixel-TTS在强基线上实现了有竞争力的性能、更快的收敛和鲁棒的零样本泛化。

英文摘要

Recent advances in pixel-based text modeling show that representing text as images enables models to exploit visual cues for language understanding. Grounding text in its visual form allows structurally similar characters with different Unicode encodings to produce similar embeddings, benefiting cross-lingual and zero-shot scenarios. Conventional text-based approaches treat each character independently, limiting generalization to unseen characters and requiring embedding expansion during cross-lingual adaptation. We propose Pixel-TTS, the first framework for visually grounded speech synthesis. It renders text as images and projects them through a 2D convolutional layer to generate embeddings. This design eliminates embedding matrix expansion during fine-tuning while improving robustness to unseen characters and orthographic variations. Extensive experiments show Pixel-TTS achieves competitive performance with strong baselines, faster convergence and robust zero-shot generalization.

2606.15267 2026-06-16 eess.AS cs.SD 交叉投稿

Dynamic Prosody Prediction in LLM-based TTS for Improving Speaker Similarity

基于LLM的TTS中的动态韵律预测以提高说话人相似度

Zhenwei Mou, Liping Chen, Yajun Hu, Zhen-Hua Ling, Xin Fang, Jianqing Gao

发表机构 * University of Science and Technology of China(中国科学技术大学) iFLYTEK

AI总结 针对LLM-based TTS忽略风格特定韵律模式导致说话人相似度不足的问题,提出基于先前预测语音的动态音节韵律预测方法,显著提升韵律学习能力和说话人相似度。

Comments Accepted to INTERSPEECH 2026. 5 pages, 2 figures. Audio samples: https://muzw.github.io/dynapros/

详情
AI中文摘要

个性化文本到语音(TTS)旨在合成语音中克隆目标说话人,模仿其声音和说话风格。当前基于大型语言模型(LLM)的TTS方法忽略了生成语音中风格特定的韵律模式,导致风格学习不足,从而限制了合成语音中的说话人相似度。为此,我们研究了基于合成语音的韵律学习,并提出基于先前预测语音来预测当前音节的韵律。在三个数据集上获得的实验结果表明,所提出的动态韵律预测方法在增强韵律学习能力方面有效,从而提高了生成语音的说话人相似度。音频样本可在 https://muzw.github.io/dynapros/ 获取。

英文摘要

Personalized text-to-speech (TTS) aims to clone the target speaker in the synthesized speech, imitating both the voice and speaking style. Current large language model (LLM)-based TTS methods ignore the style-specific prosodic patterns in generated speech, resulting in deficient style learning and thus limiting speaker similarity in synthesized speech. To this end, we investigate the prosody learning conditioned on the synthesized speech, and propose to predict the prosody of the current syllable based on previously predicted speech. Experimental results obtained on three datasets demonstrated the efficacy of the proposed dynamic prosody prediction method in enhancing the prosody learning capability, thereby improving the speaker similarity of the generated speech. Audio samples are available at https://muzw.github.io/dynapros/.

2606.16435 2026-06-16 eess.AS cs.SD 交叉投稿

Unified Audio Generation and Editing via Joint Condition Modeling and Progressive Training

统一音频生成与编辑:联合条件建模与渐进训练

Haocheng Dong, Yuheng Lu, Cheng Gong, Shansong Liu, Xiao-Lei Zhang, Xuelong Li

发表机构 * Department of Electronic Engineering and Information Science, University of Science and Technology of China(电子工程与信息科学系,中国科学技术大学) Tianjin Key Laboratory of Cognitive Computing and Application, School of Artificial Intelligence, Tianjin University(认知计算与应用重点实验室,天津大学人工智能学院) Institute of Artificial Intelligence of China Telecom (TeleAI)(中国电信人工智能研究院(TeleAI))

AI总结 提出AudioWeave统一模型,通过联合条件建模和渐进多阶段训练策略,实现文本到音频生成和音频编辑的单一框架,性能与专用模型相当。

详情
AI中文摘要

随着多媒体应用中对音频的关注日益增加,出现了许多先进的音频生成工作。现有研究通常将文本到音频(TTA)和其他相关音频生成任务(如基于指令的音频编辑)视为独立挑战,采用特定于任务的架构或模块。这种缺乏统一建模范式的情况大大增加了构建同时支持音频生成和编辑的系统的开销和复杂性,同时也导致可扩展性有限。为了解决这个问题,我们引入了AudioWeave,一个用于TTA和音频编辑的统一模型,无需额外的任务特定组件。具体来说,我们提出了一种联合条件建模方法,结合分解位置嵌入,使扩散变压器骨干能够在TTA和音频编辑的异构输入下运行。我们进一步提出了一种渐进的多阶段训练策略,以缓解多任务之间的干扰引起的任务竞争和灾难性遗忘。这反过来有助于保持每个单独任务的性能,甚至可能在某些方面带来改进。在TTA任务和六个音频编辑任务上的实验结果表明,我们的统一模型实现了与任务特定模型相当的性能,为进一步探索统一音频生成模型奠定了基础。

英文摘要

With the growing focus on audio in multimedia applications, numerous advanced works on audio generation have emerged. Existing studies typically treat text-to-audio (TTA) and other related audio generation tasks, such as instruction-based audio editing, as independent challenges, adopting task-specific architectures or modules. This absence of a unified modeling paradigm substantially increases the overhead and complexity of building a system for both audio generation and editing, while also leading to limited scalability. To address this issue, we introduce AudioWeave, a unified model for TTA and audio editing without additional task-specific components. Specifically, we propose a joint condition modeling approach with a factorized position embedding, enabling the diffusion transformer backbone to operate under heterogeneous inputs of TTA and audio editing. We further propose a progressive multistage training strategy to mitigate task competition and catastrophic forgetting caused by interference among multiple tasks. This in turn helps maintain the performance of each individual task and may even lead to improvements in certain aspects. Experimental results on TTA task and six audio editing tasks show that our unified model achieves competitive performance with task-specific models, laying a groundwork for further exploration of unified audio generation models.

2606.16668 2026-06-16 eess.AS cs.SD 交叉投稿

CraBERT: Efficient Phoneme Encoder Pre-Training via Cascade Fusion of Subword Representations for Text-to-Speech

CraBERT: 通过子词表示的级联融合实现文本转语音的高效音素编码器预训练

Dong Yang, Yuki Saito, Wataru Nakata, Hiroshi Saruwatari

AI总结 提出CraBERT音素编码器,通过级联融合架构和子词-音素对齐算法,利用预训练子词BERT减少音素编码器预训练量,一个epoch即可达到基线十个epoch的MOS值。

详情
AI中文摘要

本文介绍了CraBERT,一种预训练的音素编码器(PPEnc),专为文本转语音(TTS)中的高效预训练而设计。CraBERT采用级联融合架构和子词-音素对齐算法,将预训练的子词级BERT表示集成到音素级BERT中。这种设计提供了先验的词级和句级信息,减少了音素编码器所需的预训练量。主观听力评估表明,CraBERT在大约一个epoch的预训练后即可达到与现有PPEnc相当的MOS值,而基线模型需要预训练大约十个epoch。这些结果表明,CraBERT能够高效学习适合提高合成语音自然度和韵律的表示。

英文摘要

This paper introduces CraBERT, a pre-trained phoneme encoder (PPEnc) designed for efficient pre-training in text-to-speech (TTS). CraBERT employs a cascade-fusion architecture and a subword-phoneme alignment algorithm to integrate representations from a pre-trained subword-level BERT into a phoneme-level BERT. This design provides prior word- and sentence-level information, reducing the amount of pre-training required by the phoneme encoder. Subjective listening evaluations show that CraBERT achieves MOS values comparable to existing PPEncs after approximately one epoch of pre-training, whereas the baselines in our comparison are pre-trained for approximately ten epochs. These results demonstrate that CraBERT can efficiently learn representations suitable for improving the perceived naturalness and prosody of synthesized speech.

2603.05373 2026-06-16 cs.SD eess.AS 版本更新

MSpoofTTS: Multi-Resolution Spoof-Guided Inference for Discrete Speech Synthesis

MSpoofTTS:用于离散语音合成的多分辨率欺骗引导推理

Junchuan Zhao, Minh Duc Vu, Ye Wang

发表机构 * School of Computing, National University of Singapore(新加坡国立大学计算机学院) Department of Statistics & Data Science, National University of Singapore(新加坡国立大学统计与数据科学系)

AI总结 提出MSpoof-TTS框架,通过多分辨率欺骗检测和分层解码策略,无需重新训练即可提升神经编解码语言模型的零样本语音合成质量。

Comments 7 pages, 3 figures, 3 tables, 2 algorithms. Accepted to Interspeech 2026

详情
AI中文摘要

神经编解码语言模型能够实现高质量的离散语音合成,但其推理过程仍易受令牌级伪影和分布漂移的影响,从而降低感知真实感。我们不依赖偏好优化或重新训练,而是提出MSpoof-TTS,一种无需训练的推理框架,通过多分辨率欺骗引导来改进零样本合成。我们引入了一种基于多分辨率令牌的欺骗检测框架,在不同时间粒度上评估编解码序列,以检测局部不一致或不自然的模式。然后,我们将欺骗检测器集成到分层解码策略中,逐步修剪低质量候选并重新排序假设。这种鉴别器引导的生成在不修改模型参数的情况下增强了鲁棒性。实验验证了我们的框架在鲁棒且高质量的基于编解码的语音生成中的有效性。音频样本和代码已公开。

英文摘要

Neural codec language models enable high-quality discrete speech synthesis, yet their inference remains vulnerable to token-level artifacts and distributional drift that degrade perceptual realism. Rather than relying on preference optimization or retraining, we propose MSpoof-TTS, a training-free inference framework that improves zero-shot synthesis through multi-resolution spoof guidance. We introduce a Multi-Resolution Token-based Spoof Detection framework that evaluates codec sequences at different temporal granularities to detect locally inconsistent or unnatural patterns. We then integrate the spoof detectors into a hierarchical decoding strategy, progressively pruning low-quality candidates and re-ranking hypotheses. This discriminator-guided generation enhances robustness without modifying model parameters. Experiments validate the effectiveness of our framework for robust and high-quality codec-based speech generation. Audio samples and code are available.

2606.07015 2026-06-16 cs.SD cs.AI 版本更新

Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation

面向统一歌曲生成与带伴奏共生成的歌声转换

Ziyu Zhang, Chunyu Qiang, Xiaopeng Wang, Yuxin Guo, Kang Yin, Wenjie Tian, Jingbin Hu, Tianlun Zuo, Zhao Guo, Teng Ma, Yuzhe Liang, Chen Zhang, Lei Xie

发表机构 * Northwestern Polytechnical University(西北工业大学) Kuaishou Technology(快手科技) Beijing Institute of Technology(北京理工大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) University of Science and Technology of China(中国科学技术大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出UniSinger框架,基于多模态扩散Transformer统一零样本歌曲生成与伴奏共生成歌声转换,通过共享说话人嵌入和课程学习策略实现跨任务音色控制与多任务优化。

详情
AI中文摘要

尽管歌曲生成和歌声转换(SVC)已显著发展,但长期以来它们被孤立开发:前者缺乏零样本说话人克隆,而后者忽略了人声-伴奏协同。为弥合这一差距,我们提出UniSinger,这是首个统一说话人克隆歌曲生成与伴奏共生成SVC的端到端框架。基于多模态扩散Transformer,我们构建了一个统一的说话人嵌入空间,将说话人表示从SVC迁移到歌曲生成,从而实现细粒度的跨任务音色控制。为缓解多任务优化冲突,我们设计了一种课程学习策略,使用任务特定的模态掩码来引导模型逐步掌握语义内容、人声音色和伴奏之间的生成机制。实验表明,在两个任务上均达到最先进性能,并实现了互补优势,为智能音乐制作提供了新可能性。

英文摘要

While song generation and singing voice conversion (SVC) have evolved significantly, they have long been developed isolated: the former lacks zero-shot speaker cloning, while the latter overlooks vocal-accompaniment synergy. To bridge this gap, we propose UniSinger, the first end-to-end framework unifying speaker cloning song generation and accompaniment co-generation SVC. Building on the multimodal diffusion transformer, we construct a unified speaker embedding space transferring speaker representation from SVC to song generation, endowing fine-grained cross-task timbre control. To mitigate multi-task optimization conflicts, we design a curriculum learning strategy using task-specific modality masking to guide the model to gradually master the generative mechanisms among semantic content, vocal timbre, and accompaniment. Experiments show state-of-the-art performance on both tasks and realizes complementary benefits, offering new possibilities for intelligent music production.

2606.09717 2026-06-16 cs.SD eess.AS 版本更新

What Makes Synthetic Speech Sound Sarcastic? A Prosody-Controlled Perception Study

什么让合成语音听起来讽刺?一项韵律控制的感知研究

Zhu Li, Shekhar Nayak, Matt Coler

发表机构 * University of Groningen(格罗宁根大学)

AI总结 通过可控神经TTS系统操纵语速、音高变化和响度,发现响度主要驱动人类对讽刺的感知,而模型更依赖语速,揭示了韵律线索权重差异。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

韵律在讽刺感知中起着核心作用,然而以往的研究依赖于自然产生的语音,缺乏对单个声学维度的精细控制。由于韵律线索在自然数据中共变,隔离它们的独立贡献仍然具有挑战性。我们引入了一个受控框架,使用基于提示的韵律条件化的神经文本到语音(TTS)来操纵语速、音高变化和响度。构建了一个正交刺激集,以实现对韵律线索效应的因果测试。人类听众对讽刺性和自然度进行评分,并将他们的判断与能够处理音频输入的基础模型的预测进行比较。结果表明,响度主要驱动人类对讽刺的感知,而模型则赋予语速更大的权重,导致不同的线索加权模式。这项研究表明,可控神经TTS如何能够研究语音感知中的韵律线索加权。

英文摘要

Prosody plays an important role in sarcasm perception, yet previous studies have relied on naturally produced speech that lacks fine-grained control over individual acoustic dimensions. As prosodic cues co-vary in natural data, isolating their independent contributions remains challenging. We introduce a controlled framework using neural text-to-speech (TTS) with prompt-based prosodic conditioning to manipulate speech rate, pitch variation, and loudness. An orthogonal stimulus set was constructed to enable causal testing of prosodic cue effects. Human listeners rated sarcasm and naturalness, and their judgments were compared with predictions from a foundation model capable of processing audio input. Results show that loudness primarily drives human sarcasm perception, whereas the model assigns greater weight to speech rate, leading to distinct cue-weighting patterns. This study shows how controllable neural TTS enables investigation of prosodic cue weighting in speech perception.

2510.07096 2026-06-16 cs.CL cs.SD eess.AS 版本更新

Modeling Sarcastic Speech: Semantic and Prosodic Cues in a Speech Synthesis Framework

建模讽刺语音:语音合成框架中的语义和韵律线索

Zhu Li, Yuqing Zhang, Xiyuan Gao, Shekhar Nayak, Matt Coler

发表机构 * Speech Technology Lab, University of Groningen, Campus Fryslân, The Netherlands(格罗宁根大学弗里赛兰校区语音技术实验室,荷兰) Center for Language and Cognition, University of Groningen, The Netherlands(格罗宁根大学语言与认知中心,荷兰)

AI总结 提出一个计算框架,通过整合语义和韵律线索建模讽刺,使用微调LLaMA 3模型提取语义线索,从讽刺语音数据库提取韵律线索,语音合成测试表明两者结合能增强讽刺感知。

Comments Accepted to CogSci 2026

详情
AI中文摘要

讽刺是一种语用现象,说话者传达与字面内容不同的含义,依赖于语义和韵律表达之间的相互作用。然而,这些线索如何共同促进讽刺的识别仍知之甚少。我们提出了一个计算框架,将讽刺建模为语义解释和韵律实现的整合。语义线索来自微调的LLaMA 3模型,该模型捕捉讽刺意图的话语层面标记,而韵律线索通过从讽刺语音数据库中提取的语义对齐话语获得,提供讽刺表达的韵律范例。使用语音合成测试平台,感知评估表明语义和韵律线索增强了感知到的讽刺,组合系统在保持高主观讽刺评分的同时实现了最佳下游F1。这些发现强调了语义和韵律在语用解释中的互补作用,并说明了建模如何揭示讽刺交流背后的机制。

英文摘要

Sarcasm is a pragmatic phenomenon in which speakers convey meanings that diverge from literal content, relying on an interaction between semantics and prosodic expression. However, how these cues jointly contribute to the recognition of sarcasm remains poorly understood. We propose a computational framework that models sarcasm as the integration of semantic interpretation and prosodic realization. Semantic cues are derived from an LLaMA 3 model fine-tuned to capture discourse-level markers of sarcastic intent, while prosodic cues are extracted through semantically aligned utterances drawn from a database of sarcastic speech, providing prosodic exemplars of sarcastic delivery. Using a speech synthesis testbed, perceptual evaluations show that semantic and prosodic cues enhance perceived sarcasm, with the combined system achieving the best downstream F1 while maintaining high subjective sarcasm ratings. These findings highlight the complementary roles of semantics and prosody in pragmatic interpretation and illustrate how modeling can shed light on the mechanisms underlying sarcastic communication.

3. 说话人识别、验证与分离 1 篇

2606.16115 2026-06-16 eess.AS cs.SD 交叉投稿

Stabilizing Short Duration Speaker Verification through Neural Re-scoring with Hybrid Enrollment

通过混合注册的神经重打分稳定短时说话人验证

Zhiqi Ai, Han Cheng, Shiyi Mu, Zhiyong Chen, Yongjin Zhou, Shugong Xu

发表机构 * Shanghai University, China(上海大学) Xi’an Jiaotong-Liverpool University, China(西安交通大学利物浦大学) Hithink RoyalFlush AI Research Institute, China(Hithink皇家Flush人工智能研究院)

AI总结 针对短时说话人验证中语音时长不足导致性能下降的问题,提出混合注册神经重打分框架,结合文本相关与文本无关注册,通过并行交叉注意力进行帧级比较,在VoxPhrase语料库上取得一致提升。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

短时说话人验证(SDSV)对于个性化关键词唤醒至关重要,其中测试话语通常短于三秒。有限的语音时长导致说话人表示不稳定,并增加对噪声和音素变化的敏感性,从而降低性能。为研究这一问题,我们从VoxCeleb数据集中自动分割构建了大规模SDSV语料库VoxPhrase。分析表明,文本相关(TD)注册受时长限制,产生不稳定的说话人表示。相比之下,尽管文本无关(TI)注册引入内容不匹配,但其表示随注册时长增加而变得更加稳定。因此,我们提出一种混合注册神经重打分框架,结合TD和TI注册,并通过并行交叉注意力进行帧级比较。在VoxPhrase上的实验表明,该框架在多个说话人模型上均取得一致改进。

英文摘要

Short-duration speaker verification (SDSV) is crucial for personalized keyword spotting, where test utterances are typically shorter than three seconds. Limited speech duration results in unstable speaker representations and increased sensitivity to noise and phoneme variations, thereby degrading performance. To investigate this issue, we construct VoxPhrase, a large-scale SDSV corpus automatically segmented from the VoxCeleb dataset. Our analysis shows that text-dependent (TD) enrollment is constrained by duration and yields unstable speaker representations. In contrast, although text-independent (TI) enrollment introduces content mismatch, its representations become more stable as the enrollment duration increases. Accordingly, we propose a hybrid-enrollment neural re-scoring framework that combines TD and TI enrollment and performs frame-level comparison via parallel cross-attention. Experiments on VoxPhrase demonstrate consistent improvements across multiple speaker models.

4. 语音增强、降噪与音频修复 2 篇

2606.15540 2026-06-16 cs.SD cs.AI cs.MM eess.AS 新提交

AP-GRPO: Anchor-Gated Phonetic Alignment with Policy Optimization for Pathological Speech Reconstruction

AP-GRPO: 基于锚定门控语音对齐与策略优化的病理语音重建

Pengfei Zhang, Hoang H Nguyen, Yutong Song, Wenjun Huang, Tahmid Imtiaz Imu, Henry Peng Zou, Jiang Wu, Honghui Xu, Amir M. Rahmani

发表机构 * University of California Irvine(加州大学尔湾分校) University of Illinois Chicago(伊利诺伊大学芝加哥分校) Kennesaw State University(肯尼索州立大学)

AI总结 针对神经退行性和神经运动障碍患者的病理语音,提出AP-GRPO框架,通过锚定门控奖励和语音对齐奖励优化语音语言模型,实现忠实重建,并揭示疾病特异性模式。

详情
AI中文摘要

来自神经退行性和神经运动障碍患者的病理语音通常在声学上失真且语言上支离破碎,因此需要病理语音重建来从失真和不完整的语音录音中恢复预期的文本内容。关键在于,此类录音很少均匀退化:一些单词或短语仍然可靠,可以作为可听锚点来重建受损的周围内容。我们引入了锚定门控语音组相对策略优化(AP-GRPO),这是一个带有语音奖励的GRPO框架,通过可听锚点保留和锚点间语音兼容性来对齐语音语言模型(SLM)与原始语音信号。AP-GRPO包括:(i)一个锚定门控奖励,用于匹配清晰区域中的可靠可听锚点;(ii)一个锚点间语音对齐奖励,用于评估恢复的内容是否在语音上得到相应受损锚点间语音片段的支持。在四种疾病条件下,AP-GRPO提高了忠实语音重建,并且学习的锚点约束自动适应每种条件,从而揭示可解释的疾病特异性特征:严重发音退化条件需要更强的锚点强制,而轻度损伤或语言障碍条件则更依赖于锚点间恢复的语音对齐。

英文摘要

Pathological speech from patients with neurodegenerative and neuromotor disorders is often acoustically distorted and linguistically fragmented, making pathological speech reconstruction necessary to recover intended textual content from distorted and incomplete speech recordings. Crucially, such recordings are rarely uniformly degraded: some words or short phrases remain reliable and can serve as audible anchors for reconstructing the corrupted surrounding content. We introduce Anchor-gated Phonetic Group Relative Policy Optimization (AP-GRPO), a GRPO framework with phonetic reward that aligns speech language models (SLMs) through audible-anchor preservation and inter-anchor phonetic compatibility to the original speech signal. AP-GRPO consists of: (i) an anchor-gated reward that matches reliable audible anchors in clear regions; and (ii) an inter-anchor phonetic alignment reward that evaluates whether recovered contents are phonetically supported by the corresponding corrupted inter-anchor speech span. Across four disease conditions, AP-GRPO improves faithful speech reconstruction, and the learned anchor constraint automatically adapts to each condition and thus reveals interpretable disease-specific profiles: conditions with severe articulatory degradation require stronger anchor enforcement, whereas milder impairment or linguistically impaired conditions rely more on phonetic alignment for inter-anchor recovery.

2606.16464 2026-06-16 eess.AS cs.SD 交叉投稿

Towards Robust Generative Speech Enhancement Using Vector Quantisation-Based Neural Audio Codec

基于向量量化的神经音频编解码器实现鲁棒性生成式语音增强

Haixin Zhao, Nilesh Madhu

发表机构 * IDLab, Ghent University - imec(IDLab,根特大学 - imec)

AI总结 研究VQ-NAC中连续与离散潜在空间的建模策略,提出cNAC-SE和dNAC-SE框架,发现VQ正则化通过干净先验约束增强鲁棒性,且cNAC-SE在DNS-MOS指标上领先。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

本工作研究了基于向量量化(VQ)的神经音频编解码器(NAC)语音增强(SE)中连续和离散潜在空间的建模策略,以及VQ正则化的作用。我们提出了cNAC-SE和dNAC-SE框架,分别预测潜在空间中的连续表示和离散令牌。进行了理论分析和潜在空间可视化,以展示其内在建模机制。实验结果表明,完全微调的cNAC-SE模型在各种测试条件下始终优于所有dNAC-SE变体,并在DNS-MOS指标中达到领先性能。与判别式方法的比较表明,VQ通过干净先验约束正则化的内在效应增强了鲁棒性,且独立于离散令牌处理。这突出了VQ正则化对其他连续建模方法的可迁移价值。

英文摘要

This work investigates modelling strategies in continuous and discrete latent spaces in the vector quantisation (VQ)-based neural audio codec (NAC) speech enhancement (SE), along with the role of VQ regularisation. We propose cNAC-SE and dNAC-SE frameworks that predict continuous representations and discrete tokens in latent space, respectively. Theoretical analysis and visualisations in latent space are performed to exhibit their inherent modelling mechanisms. Experimental results show that the fully fine-tuned cNAC-SE model consistently outperforms all dNAC-SE variants across diverse test conditions and achieves leading performance among established generative approaches in DNS-MOS metrics. Comparison with the discriminative counterpart shows that VQ enhances robustness through an intrinsic effect of clean-prior-constrained regularisation, independent of discrete token processing. This highlights the transferable value of VQ regularisation to other continuous modelling methods.

5. 音频事件检测与场景理解 1 篇

2507.07879 2026-06-16 cs.SD eess.AS 版本更新

LISTEN: Lightweight Industrial Sound-representable Transformer for Edge Notification

LISTEN:面向边缘通知的轻量级工业声音可表示Transformer

Changheon Han, Yun Seok Kang, Yuseop Sim, Hyung Wook Park, Martin Byung-Guk Jun

发表机构 * School of Mechanical Engineering, Purdue University(普渡大学机械工程学院) Department of Mechanical Engineering, UNIST(UNIST机械工程系)

AI总结 提出轻量级工业声音基础模型LISTEN,通过知识蒸馏从大模型IMPACT压缩,仅用少量数据微调即可在边缘设备上实现实时机器监控,性能接近大模型。

详情
Journal ref
Advanced Engineering Informatics, Volume 76, Part A, 2026, 104944
AI中文摘要

基于深度学习的机器听觉正在拓宽工业声学分析的范围,但其在实时车间中的广泛实施受到对每个新任务依赖大型、任务特定标注数据集的阻碍。虽然新兴的通用声音基础模型旨在减轻数据依赖性,但它们在实践中暴露出关键困境。通用声音基础模型计算成本高,并且在以音调谐波、宽带噪声和瞬态故障事件为特征的工业场景中失败,使得即时、现场部署不切实际。这些挑战共同意味着,在实时车间部署声音基础模型的实用端到端系统仍然难以实现。为了解决这一挑战,本研究引入了LISTEN(面向边缘通知的轻量级工业声音可表示Transformer),这是第一个专门针对工业声音的轻量级基础模型。通过从大规模教师模型IMPACT(基于声学认知Transformer的工业机器感知)进行知识蒸馏,我们构建了针对资源受限边缘环境优化的LISTEN。通过冻结骨干网络并仅对最小目标过程数据训练浅层头部,而不是进行完全微调或重新训练,LISTEN在多种制造过程中实现了与IMPACT几乎相同的性能。本研究进一步展示了一个完整的实时机器监控系统,包括使用工业物联网(IIoT)设备进行数据采集、使用最小标注数据进行快速模型适应,以及在低成本边缘设备上进行实时监控。通过在实时CNC机器上验证整个系统,这项工作建立了在活跃工业环境中部署轻量级工业声音基础模型的第一个可行的端到端系统。

英文摘要

Deep learning-based machine listening is broadening the scope of industrial acoustic analysis, yet its widespread implementation on live shop floors is hindered by the reliance on large, task-specific annotated datasets for every new task. While emerging general-purpose sound foundation models aim to alleviate data dependency, they reveal critical dilemmas in practice. General-purpose sound foundation models are computationally expensive and fail in industrial scenarios characterized by tonal harmonics, broadband noise, and transient fault events, making instant, on-site deployment impractical. These challenges combined mean that a practical, end-to-end system for deploying a sound foundation model on a live shop floor has remained elusive. To address this challenge, this study introduces LISTEN (Lightweight Industrial Sound-representable Transformer for Edge Notification), the first lightweight foundation model specialized for industrial sound. Through Knowledge Distillation (KD) from the large-scale teacher model IMPACT (Industrial Machine Perception via Acoustic Cognitive Transformer), we construct LISTEN optimized for resource-constrained edge environments. By freezing the backbone and training only a shallow head on minimal target-process data, rather than performing full fine-tuning or retraining, LISTEN achieves nearly identical performance to IMPACT across diverse manufacturing processes. This study further demonstrates a complete system for real-time machine monitoring, encompassing data acquisition with Industrial Internet of Things (IIoT) devices, rapid model adaptation using minimal annotated data, and real-time monitoring on a low-cost edge device. By validating the entire system on a live CNC machine, this work establishes the first feasible end-to-end system for deploying a lightweight industrial sound foundation model in an active industrial environment.

6. 音乐信息检索与音乐生成 6 篇

2606.16412 2026-06-16 cs.SD eess.AS math.HO math.NT 新提交

An Asymmetric Formula for Interval Consonance and its Relation to Harmonic Coincidence

区间协和的不对称公式及其与谐波重合的关系

David De Roure

发表机构 * University of Oxford(牛津大学) Royal Northern College of Music(皇家北方音乐学院)

AI总结 提出一个不对称公式 f(p/q) = p + Ω(q) 用于度量音程协和度,并证明在标准协和数据上表现良好,同时揭示了与欧拉和谐波重合模型的联系。

Comments Working note to support OEIS submissions

详情
AI中文摘要

欧拉的 Gradus Suavitatis (1739) 通过公式 G(p/q) = 1 + Ω(p) + Ω(q) 为音程 p/q 分配一个不协和值,其中 Ω(n) = \sum_i e_i(p_i - 1) 对 n 的加权质数指数求和。我们提出更简单的不对称公式 f(p/q) = p + Ω(q),该公式对分子和分母区别对待,并在标准协和数据上表现相当。我们还表明,在谐波被整数索引并均匀计数到固定截断水平的模型下,Gradus 等价于加权谐波重合计数,权重为 w(n) = Ω(n),从而将其与伽利略早期的脉冲重合模型 (1638) 联系起来。该公式自然生成一个互质整数三角形 T(n,k) = n + Ω(k),其最右对角线给出了超特定(连续谐波)音程的两阶段不协和。公式 f 允许在谐波背景和部分识别方面进行简单的两阶段解释,我们将其作为一种推测性的感知假设提出。

英文摘要

Euler's Gradus Suavitatis (1739) assigns a dissonance value to a musical interval p/q by the formula G(p/q) = 1 + Ω^(p) + Ω^(q), where Ω^(n) = \sum_i e_i(p_i - 1) sums the weighted prime exponents of n. We propose the simpler asymmetric formula f(p/q) = p + Ω^(q), which treats numerator and denominator differently and performs comparably on standard consonance data. We also show that, under a model in which harmonics are integer-indexed and counted uniformly up to a fixed truncation level, Gradus is equivalent to a weighted harmonic coincidence count with weights w(n) = Ω^(n), connecting it to Galileo's earlier pulse-coincidence model (1638). The formula naturally generates a coprime integer triangle T(n,k) = n + Ω^(k), whose rightmost diagonal gives the two-stage dissonance of the superparticular (consecutive-harmonic) intervals. The formula f admits a simple two-stage interpretation in terms of harmonic context and partial recognition, which we offer as a speculative perceptual hypothesis.

2606.16612 2026-06-16 cs.SD cs.LG cs.MM 新提交

Beyond Artifacts: Towards Generalizable Synthetic Song Detection via Music-Intrinsic Features

超越伪影:基于音乐内在特征的可泛化合成歌曲检测

Yan Han, Zhibin Wen, Yuan Wang, Shuangrun Shao, Xiaobing Li, Yang Xu, Wei Li

发表机构 * Central Conservatory of Music(中央音乐学院) Southern University of Science and Technology(南方科技大学) Fudan University(复旦大学)

AI总结 提出Sofia框架,通过特征特定专家和自适应混合专家模型利用音乐内在特征(人声、音频效果、全局结构)进行合成歌曲检测,在MUSIC8K基准上F1提升18.5点,具有强鲁棒性。

详情
AI中文摘要

AI音乐生成器的快速发展凸显了对可靠合成歌曲检测(SSD)的迫切需求。现有SSD方法通常依赖于低级伪影或固定特征假设,难以捕捉生成器无关的线索。为解决这一问题,我们提出Sofia(基于音乐特征的合成歌曲检测框架),一个灵活的框架,通过特征特定专家和自适应混合专家(MoE)模块对音乐内在属性进行建模。通过使用代表性的人声、音频效果、全局结构特征及其组合配置Sofia,我们展示了它们的个体和互补贡献。为全面评估我们的框架,我们进一步构建了MUSIC8K,一个具有挑战性的基准,包含最新出现的生成器和逼真的音频扰动。实验表明,Sofia从音乐内在特征中学习生成器无关的表示,在MUSIC8K-O上相比最强基线F1分数提升18.5点,同时保持强鲁棒性。

英文摘要

The rapid advancement of AI music generators highlights the urgent need for reliable Synthetic Song Detection (SSD). Existing SSD methods often rely on low-level artifacts or fixed feature assumptions, struggling to capture generator-agnostic cues. To address this, we propose Sofia (Synthetic-song detection framework via music features), a flexible framework that models music-intrinsic attributes via feature-specific experts and an adaptive Mixture-of-Experts (MoE) module. By configuring Sofia with representative Vocal, Audio-effect, Global structure features, and their combinations, we present their individual and complementary contributions. To comprehensively evaluate our framework, we further construct MUSIC8K, a challenging benchmark featuring lastest emerging generators and realistic audio perturbations. Experiments show that Sofia learns generator-agnostic representations from music-intrinsic features, improving the F1 score by 18.5 points over the strongest baseline on MUSIC8K-O while maintaining strong robustness.

2606.17006 2026-06-16 cs.SD cs.AI cs.LG cs.MM eess.AS 新提交

TuneJury: An Open Metric for Improving Music Generation Preference Alignment

TuneJury: 一种改进音乐生成偏好对齐的开放指标

Yonghyun Kim, Junwon Lee, Haiwen Xia, Yinghao Ma, Junghyun Koo, Koichi Saito, Yuki Mitsufuji, Chris Donahue

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Sony AI(索尼AI) Georgia Tech(佐治亚理工学院) KAIST(韩国科学技术院) Peking University(北京大学) QMUL(伦敦玛丽女王大学)

AI总结 提出TuneJury,一个开放、实例级别的成对奖励模型,用于文本到音乐生成,通过预测偏好分数支持数据筛选、后处理校准,并在推理、优化和训练中提升对齐效果。

Comments 32 pages, 9 figures

详情
AI中文摘要

我们引入了TuneJury,一个开放、实例级别的成对奖励模型,用于文本到音乐生成,它从文本提示和音频片段中预测音乐偏好分数。发布的检查点在公开的人类偏好标签上训练,涵盖竞技场风格(A vs. B)投票、度量对齐偏好对、众包成对比较和专家审美评分。两个片段之间的预测分数差在我们的保留测试集上校准良好,支持通过简单的分数阈值进行数据筛选。TuneJury泛化到保留测试对和分布外基准,在后一任务上与先前基线保持竞争力。对于训练后发布的生成器,我们引入了锚定校准,一种事后、每系统的Bradley-Terry校准,以显著优于从头再训练的数据效率恢复一致性。相同的冻结奖励在三个下游应用中驱动一致的奖励轴增益:推理时的最佳N选择、DITTO风格的潜在优化和专家迭代后训练。TuneJury可在https://github.com/yonghyunk1m/TuneJury获取。

英文摘要

We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) votes, metric-alignment preference pairs, crowdsourced pairwise comparisons, and expert aesthetic ratings. The predicted score margin between two clips is well calibrated on our held-out test split, supporting data filtering via a simple score threshold. TuneJury generalizes to both held-out test pairs and out-of-distribution benchmarks, remaining competitive with prior baselines on the latter. For generators released after training, we introduce anchor calibration, a post-hoc, per-system Bradley-Terry calibration that recovers agreement at substantially better data efficiency than from-scratch retraining. The same frozen reward drives consistent reward-axis gains across three downstream applications: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available at https://github.com/yonghyunk1m/TuneJury.

2606.15813 2026-06-16 eess.AS cs.SD 交叉投稿

AdaTT: Text-Guided Instrument Timbre Transfer with Target-Adaptive Structural Control

AdaTT: 文本引导的乐器音色转换与目标自适应结构控制

Dabin Kim, Junwon Lee, Juhan Nam

发表机构 * Graduate School of Cultural Technology(文化科技研究生院) Graduate School of AI(人工智能研究生院)

AI总结 提出AdaTT系统,通过文本提示自适应调整音高和响度控制的影响,解决细粒度结构条件下乐器音色转换中的音色模糊问题,实现高音色保真度。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

本文解决了细粒度结构条件下乐器音色转换中的音色模糊问题。我们认为这一问题源于这些条件下乐器特有的表达细节与目标音色属性之间的冲突。例如,将小提琴的音高主导颤音轮廓强加于自然表现出响度主导颤音的长笛上,会损害音色保真度。我们提出了AdaTT,一个在ControlNet方案内确保跨不同音色转换场景高音色保真度的目标自适应系统。它通过文本提示选择性地缩放帧级音高和响度控制的影响,以匹配目标乐器的身份。我们还提出了一个半自动数据构建流程,以教导模型哪些表达细节需要转换或保留。结果表明,AdaTT在保持乐谱级内容的同时,实现了卓越的音色保真度和自然度。音频样本可在https://dabinkim0.github.io/adatt/获取。

英文摘要

This paper addresses timbral ambiguity in instrument timbre transfer under fine-grained structural conditions. We argue this issue stems from instrument-specific expressive details in these conditions, which conflict with the target timbral properties. For example, imposing a violin's pitch-dominant vibrato contours onto a flute, which naturally exhibits loudness-dominant vibrato, impairs timbral fidelity. We propose AdaTT, a target-adaptive system that ensures high timbral fidelity across diverse timbre transfer scenarios within the ControlNet scheme. It selectively scales the frame-wise influence of pitch and loudness controls via text prompts to match the target instrument's identity. We also present a semi-automatic data construction pipeline to teach the model which expressive details to transform or preserve. Results show AdaTT achieves superior timbral fidelity and naturalness while retaining score-level content. Audio samples are available at https://dabinkim0.github.io/adatt/.

2605.04998 2026-06-16 cs.SD cs.IR cs.LG 版本更新

Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation

流行与爵士混合比例对体裁自适应和弦生成的实证研究

Jinju Lee

发表机构 * PearlLeeStudio(pearllee studio)

AI总结 本研究通过调整流行与爵士音乐的比例进行和弦生成排练,发现适度的流行排练能在保持流行准确率的同时提升爵士预测性能,并修正了先前版本中的检查点选择错误。

Comments Erratum: the released F1 checkpoint equals the Phase-0 pop baseline (full SHA-256 verified); min mixed validation loss selection kept the unadapted warmup epoch. Tables 4 and 5 are best epoch metrics; mix ratio conclusions hold. A corrected retrain (jazz only validation), ft-pop80-v2, reproduces across 3 seeds. v1 F2 row fixed. 3 figs, 5 tables. https://huggingface.co/PearlLeeStudio

详情
AI中文摘要

本修订更新了一项流行到爵士和弦生成的排练研究。最佳时期的指标仍然表明,适度的流行排练能在保持流行准确率的同时提高爵士预测性能,但v2版本修正了已发布检查点的选择:已发布的F1等于阶段0,F2存在转录错误,而ft-pop80-v2恢复了跨3个种子的哈希区分爵士适应F1。

英文摘要

This revision updates a pop-to-jazz chord-generation rehearsal study. Best-epoch metrics still show that modest pop rehearsal preserves pop accuracy while improving jazz prediction, but v2 corrects released-checkpoint selection: the released F1 equals Phase 0, F2 had a transcription error, and ft-pop80-v2 restores a hash-distinct jazz-adapted F1 across 3 seeds.

2606.07334 2026-06-16 cs.SD cs.LG 版本更新

How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity? Capabilities and Boundaries in Multi-Genre Chord-Symbol Modeling

和弦符号时间序列适应能承载多远流派身份?多流派和弦符号建模的能力与边界

Jinju Lee

发表机构 * PearlLeeStudio

AI总结 本研究评估了五种轻量级适应方法(LoRA、IA3、BitFit、前缀微调和全微调)将预训练流行爵士和弦模型扩展到11个目标流派的效果,发现所有方法均能提升和弦预测性能,但和弦符号本身不足以完整传递流派身份。

Comments v3: ft-pop80-v2, a selection-corrected, hash-distinct jazz base, exists, reproducing over 3 seeds (top-1 75.76 +/- 0.03), so the Sec. 8 base robustness ablation is now gated by effort, not checkpoint availability. Added a v3 changelog; corrected Sec. 5.2/6.3/6.9 stats for CSV fidelity (no qualitative changes). https://github.com/PearlLeeStudio/TheArtist | https://huggingface.co/PearlLeeStudio

详情
AI中文摘要

和声是一个紧凑的符号层,其中数学音高关系、声学协和与音乐惯例交汇。本报告将和弦符号序列视为音乐的不完全表示,而是作为可解释、可控的时间序列用于流派局部和声建模。从一个冻结的流行爵士音乐变换器检查点开始,我评估了小型适应接口能将模型扩展到11个目标流派的程度:布鲁斯、波萨诺瓦、巴赫众赞歌、乡村、电子、民谣、放克、福音、嘻哈、R&B/灵魂乐和摇滚。主要比较了LoRA、IA3、BitFit、前缀微调和全微调在11个流派和3个种子上的表现,构成完整的165个单元格网格。所有五种方法在保留和弦预测上都优于冻结基线,宏观增益从+2.89到+3.61分;LoRA和IA3得分最高,但经Holm和Benjamini-Hochberg校正的Wilcoxon检验不支持决定性优胜者。一个匹配数据量的对照实验进一步明确了这一点:当流派被子采样到共同语料库大小时,IA3保持领先,但LoRA的全数据优势消失并跌至最后,表明小差距部分由数据驱动。一个控制标记基线也很强,错误流派适配器通常优于冻结基线,表明大部分效果来自对可重用和声基底的轻量级条件化,而非特定适配器家族。额外的诊断(秩扫描、错误流派轮换、基础检查点消融、仅和弦流派分类、生成输出统计、真实歌曲评估和重复分析)支持一个有限的结论:和弦符号适应可靠地改进了流派局部和声预测,但仅靠和弦符号不能承载完整的流派身份。因此,本报告避免关于感知流派真实性或完整音乐质量的声明,这需要受控的听众或音乐家评估。

英文摘要

This revision updates an 11-genre chord-symbol adaptation report. The main 165-cell result is unchanged: all methods improve over the frozen pure-pop base, with no decisive method winner. v3 adds the ft-pop80-v2 multi-seed base-restoration note and corrects a few summary statistics for exact CSV faithfulness without changing conclusions.

7. 语音翻译与语音语言模型 2 篇

2506.16738 2026-06-16 cs.CL cs.AI cs.SD eess.AS 版本更新

LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization

LM-SPT:面向语音标记化的LM对齐语义蒸馏

Daejin Jo, Jeeyoung Yun, Byungseok Roh, Sungwoong Kim

发表机构 * Department of Artificial Intelligence, Korea University(韩国大学人工智能系) Multi-modal Model Training, Kakao Corp(Kakao公司多模态模型训练部)

AI总结 提出LM-SPT方法,通过语义语音重合成蒸馏,在不降低帧率的情况下生成与语言模型更对齐的离散语音标记,在ASR和TTS任务中优于现有方法。

详情
AI中文摘要

随着语音语言模型(SLM)的快速发展,离散语音标记已成为语音和文本之间的核心接口,实现了跨模态的统一建模。最近的语音标记化方法旨在从低级声学中分离语义信息,以更好地与语言模型(LM)对齐。特别是,以前的方法使用自监督学习(SSL)教师模型(如HuBERT)提取语义表示,然后将其蒸馏到语义量化器中,以抑制声学冗余并捕获与内容相关的潜在结构。然而,这些标记器通常以相对较高的帧率运行,产生的标记序列明显长于其文本对应物,阻碍了与预训练LM的无缝集成。尽管最近的方法尝试通过对SSL特征应用均匀平均池化来降低标记率,但这可能会过度平滑包含内容的区域并稀释结构信息,从而可能限制LM对齐。为了解决这个问题,我们提出了LM-SPT,一种基于语义语音重合成蒸馏的LM对齐语音标记化方法。LM-SPT不是通过池化直接匹配教师和学生特征,而是仅从语义标记重合成语音,并使用冻结的、LM对齐的语音编码器最小化从原始波形和重合成波形提取的表示之间的差异。这种间接监督避免了严格的时间对齐,并鼓励在降低帧率下与LM更语义对齐的专用语义单元。实验结果表明,在自动语音识别和文本到语音任务中,即使在不损害编解码器级别的语音重建保真度的情况下,所提出的LM-SPT在应用于SLM时也始终优于先前的语义增强语音标记器。

英文摘要

With the rapid progress of speech language models (SLMs), discrete speech tokens have emerged as a core interface between speech and text, enabling unified modeling across modalities. Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models (LMs). In particular, previous methods use self-supervised learning (SSL) teachers such as HuBERT to extract semantic representations, which are then distilled into a semantic quantizer to suppress acoustic redundancy as well as capture content-related latent structures. However, these tokenizers often operate at relatively high frame rates, producing token sequences significantly longer than their textual counterparts and hindering seamless integration with pretrained LMs. Although recent methods attempt to reduce the token rate by applying uniform average pooling to SSL features, this can over-smooth content-bearing regions and dilute the structural information, thereby potentially limiting the LM alignment. To address this, we propose LM-SPT, an LM-aligned speech tokenization method based on semantic speech-resynthesis distillation. Instead of directly matching teacher and student features via pooling, LM-SPT resynthesizes speech from semantic tokens only and minimizes the discrepancy between representations extracted from the original and resynthesized waveforms using a frozen, LM-aligned speech encoder. This indirect supervision avoids rigid temporal alignment and encourages dedicated semantic units that are more semantically aligned with LMs under reduced frame rates. Experimental results show that the proposed LM-SPT consistently outperforms previous semantic-enhanced speech tokenizers when applied to SLMs for the tasks of automatic speech recognition and text-to-speech, even without compromising the speech reconstruction fidelity at the codec level.

2603.05299 2026-06-16 cs.LG cs.AI cs.CL cs.SD 版本更新

WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation

WavSLM: 通过WavLM蒸馏的单流语音语言建模

Luca Della Libera, Cem Subakan, Mirco Ravanelli

发表机构 * Concordia University(康科迪亚大学) Mila-Quebec AI Institute(蒙特利尔AI研究所) Université Laval(拉瓦尔大学)

AI总结 提出WavSLM,通过量化蒸馏WavLM自监督表示到单一码本并优化自回归下一块预测,实现无文本监督的单流语音语言建模,在一致性和生成任务上表现竞争。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

大型语言模型表明,简单的自回归训练可以产生可扩展且连贯的生成,但由于语义和声学信息的纠缠,将这一范式扩展到语音仍然具有挑战性。大多数现有的语音语言模型依赖于文本监督、分层令牌流或复杂的混合架构,偏离了在文本中已被证明有效的单流生成预训练范式。在这项工作中,我们引入了WavSLM,一种通过将自监督WavLM表示量化和蒸馏到单一码本中,并优化自回归下一块预测目标来训练的语音语言模型。WavSLM在单个令牌流中联合建模语义和声学信息,无需文本监督或文本预训练。尽管其简单性,它在一致性基准和语音生成方面取得了有竞争力的性能,同时使用更少的参数、更少的训练数据,并支持流式推理。

英文摘要

Large language models show that simple autoregressive training can yield scalable and coherent generation, but extending this paradigm to speech remains challenging due to the entanglement of semantic and acoustic information. Most existing speech language models rely on text supervision, hierarchical token streams, or complex hybrid architectures, departing from the single-stream generative pretraining paradigm that has proven effective in text. In this work, we introduce WavSLM, a speech language model trained by quantizing and distilling self-supervised WavLM representations into a single codebook and optimizing an autoregressive next-chunk prediction objective. WavSLM jointly models semantic and acoustic information within a single token stream without text supervision or text pretraining. Despite its simplicity, it achieves competitive performance on consistency benchmarks and speech generation while using fewer parameters, less training data, and supporting streaming inference.

8. 多模态音频与视听学习 3 篇

2606.14788 2026-06-16 cs.SD cs.AI cs.LG eess.AS 新提交

Unifying Acoustic Features and Text with Multimodal LLMs for Neurodegenerative Screening

统一声学特征与文本的多模态大语言模型用于神经退行性疾病筛查

Qingfeng Zhang, Yuanxiong Guo, Yanmin Gong

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出NeurMLLM框架,通过多模态大语言模型融合声谱图、MFCC和文本,实现阿尔茨海默病和帕金森病的精细分期,优于传统方法和现有LLM方法。

Comments IEEE International Conference on Healthcare Informatics, 2026

详情
AI中文摘要

基于语音的筛查为评估阿尔茨海默病(AD)和帕金森病(PD)等神经退行性疾病提供了一种可扩展且非侵入性的方式,但由于整合异质数据的困难,其分期仍然具有挑战性。本文提出了NeurMLLM,一种用于神经退行性疾病分期的高效多模态生成框架。NeurMLLM首先使用视觉变换器对音频数据的声谱图和梅尔频率倒谱系数进行编码,并将其表示投影到大语言模型(LLM)的嵌入空间中,在那里它们与转录文本和人口统计指令标记连接成一个统一的序列。然后,通过低秩适应使用任务提示对LLM进行指令微调,以自回归方式预测受限的标签标记,从而实现生成式分类。通过在Bridge2AI-Voice数据集上对AD和PD进行细粒度分期评估,我们观察到NeurMLLM取得了强劲的性能,持续优于经典机器学习方法和现有的基于LLM的方法。结果表明,多模态LLM在神经退行性疾病分期中具有巨大潜力,提高了分期准确性并支持可访问的部署。

英文摘要

Voice-based screening offers a scalable and non-invasive way to assess neurodegenerative diseases such as Alzheimer's disease (AD) and Parkinson's disease (PD), but their staging remains challenging due to the difficulty of integrating heterogeneous data. This paper presents NeurMLLM, an efficient multimodal generative framework for neurodegenerative disease staging. NeurMLLM first encodes the spectrograms and Mel-frequency cepstral coefficients of audio data with vision transformers and projects their representations into the embedding space of a large language model (LLM), where they are concatenated with transcript and demographic instruction tokens as a single unified sequence. The LLM is then instruction-tuned via Low-Rank Adaptation using task prompts to autoregressively predict a constrained label token, enabling a generative classification. By evaluating on the Bridge2AI-Voice dataset for fine-grained staging of AD and PD, we observe that NeurMLLM achieves strong performance, consistently outperforming classical machine learning methods and existing LLM-based approaches. The results show the high potential of multimodal LLMs in neurodegenerative disease staging, improving staging accuracy and supporting accessible deployment.

2606.16731 2026-06-16 cs.SD cs.AI cs.HC 新提交

MuVAP: Multimodal Multiparty Voice Activity Projection for Turn-taking Prediction in the Wild

MuVAP: 面向野外对话轮次预测的多模态多方语音活动投影

Haotian Qi, Gabriel Skantze

发表机构 * Department of Speech Music and Hearing, KTH Stockholm, Sweden(瑞典皇家理工学院言语、音乐与听觉系)

AI总结 提出MuVAP框架,通过将声学预测锚定到面部轨迹,实现从单声道音频和单摄像头视角进行说话人感知的轮次预测,并引入角色相对投影和AVCC数据集解决多方建模和因果跟踪问题。

详情
AI中文摘要

当前的多方对话轮次模型通常依赖于复杂的麦克风阵列或多摄像头设置,限制了它们在人与机器人交互场景中的适用性。我们提出了MuVAP,这是一个因果多模态框架,通过将声学预测锚定到面部轨迹来扩展语音活动投影,从而能够从单声道音频流和单摄像头视角进行说话人感知的轮次预测。为了解决建模多个说话人的组合复杂性,我们提出了角色相对投影,它将任意N说话人交互映射到一个固定的当前与下一个话语持有者状态。由于现有的视听数据集包含破坏因果跟踪的剪辑切换,我们引入了视听对话语料库,这是一个31小时的未剪辑、单摄像头多方对话数据集。评估表明,MuVAP在两人和三人场景下的转换-保持和下一说话人预测任务中优于强基线。

英文摘要

Current multiparty turn-taking models often rely on complex microphone arrays or multi-camera setups, limiting their applicability in human-robot interaction scenarios. We introduce MuVAP, a causal multimodal framework that extends Voice Activity Projection by grounding acoustic predictions in face tracks, enabling speaker-aware turn-taking predictions from a monaural audio stream and a single camera view. To address the combinatorial complexity of modeling multiple speakers, we propose Role-Relative Projection, which maps any N-speaker interaction onto a fixed current versus next floor-holder state. Because existing audiovisual datasets contain disruptive editing cuts that break causal tracking, we introduce the Audio-Visual Conversation Corpus, a 31-hour dataset of unedited, single-camera multiparty conversations. Evaluations demonstrate that MuVAP outperforms strong baselines on Shift-Hold and next-speaker prediction tasks across two- and three-speaker settings.

2602.01394 2026-06-16 eess.AS cs.LG cs.SD 版本更新

SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling

SSNAPS: 基于扩散逆采样的语音与背景噪声视听分离

Yochai Yemini, Yoav Ellinson, Rami Ben-Ari, Sharon Gannot, Ethan Fetaya

发表机构 * Bar-Ilan University(巴伊兰大学) OriginAI

AI总结 提出一种无监督的视听语音分离方法,利用扩散先验和逆采样联合建模语音与噪声,在单麦克风场景下优于有监督基线,并支持离屏说话人分离。

详情
AI中文摘要

本文解决了在真实环境噪声下进行视听单麦克风语音分离和增强的挑战。我们的方法基于生成逆采样,其中我们用专用的扩散先验对干净语音和环境噪声进行建模,并联合利用它们来恢复所有潜在源。为此,我们重新制定了一个最近的逆采样器以匹配我们的设置。我们在包含1、2和3个说话人以及噪声的混合信号上进行了评估,结果表明,尽管是完全无监督的,我们的方法在所有条件下的WER上始终优于领先的有监督基线。我们进一步扩展了我们的框架以处理离屏说话人分离。此外,分离出的噪声分量具有高保真度,使其适用于声学场景的下游检测。代码和预训练模型将在接收后提供。演示页面:此 https URL

英文摘要

This paper addresses the challenge of audio-visual single-microphone speech separation and enhancement in the presence of real-world environmental noise. Our approach is based on generative inverse sampling, where we model clean speech and ambient noise with dedicated diffusion priors and jointly leverage them to recover all underlying sources. To achieve this, reformulate a recent inverse sampler to match our setting. We evaluate on mixtures of 1, 2, and 3 speakers with noise and show that, despite being entirely unsupervised, our method consistently outperforms leading supervised baselines in WER across all conditions. We further extend our framework to handle off-screen speaker separation. Moreover, the high fidelity of the separated noise component makes it suitable for downstream detection of the acoustic scene. Code and pretrained models will become available upon acceptance. Demo page: https://ssnaps2026.github.io/ssnaps2026/

9. 低资源、多语言与方言语音 3 篇

2606.16539 2026-06-16 eess.AS cs.SD 交叉投稿

Decoding while Adapting: Zero-Shot Online Speaker Adaptation via Audio-Textual Prompts for Elderly Speech Recognition

解码与自适应:基于音频-文本提示的零样本在线说话人自适应用于老年人语音识别

Chengxi Deng, Xurong Xie, Shujie Hu, Mengzhe Geng, Tianzi Wang, Youjun Chen, Huimeng Wang, Haoning Xu, Jiajun Deng, Xunying Liu

发表机构 * The Chinese University of Hong Kong, Hong Kong SAR, China(香港中文大学) Institute of Software, Chinese Academy of Sciences, China(中国科学院软件研究所) National Research Council Canada, Canada(加拿大国家研究理事会)

AI总结 提出一种基于跨语句音频-文本提示的说话人自适应方法,实现零样本实时适应未见说话人,在老年人语音数据集上显著降低词错误率/字符错误率,并大幅提升实时因子。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

本文提出了一种新颖的基于跨语句音频-文本提示的说话人自适应方法,用于老年人语音识别。它能够对未见说话人进行零样本、实时自适应。从当前和前面几个语句中提取语音和文本嵌入,然后以跨模态方式融合,生成比i/x-向量和ECAPA-TDNN特征更一致的紧凑说话人提示。在英语DementiaBank Pitt和粤语JCCOCC MoCA老年人语音数据集上的实验表明,所提出的在线自适应相比说话人无关(SI)模型,在词错误率(WER)或字符错误率(CER)上取得了统计显著的降低,绝对降低分别为0.61%和1.22%(相对降低2.99%和4.48%)。与离线批处理自适应相比,实时因子(RTF)加速比高达9.83倍。

英文摘要

This paper proposes a novel cross-utterance audio-textual prompts based speaker adaptation approach for elderly speech recognition. It enables zero-shot, real-time adaptation to unseen speakers. Speech and text embeddings are extracted from the current and a few preceding utterances, before being fused in a cross-modal manner to produce compact speaker prompts that are more consistent than i/x-vectors and ECAPA-TDNN features. Experiments on the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets suggest that the proposed online adaptation outperforms the speaker-independent (SI) model by statistically significant word error rate (WER) or character error rate (CER) reductions of 0.61% and 1.22% absolute (2.99% and 4.48% relative). Real-time factor (RTF) speed-up ratios of up to 9.83 times are obtained over offline batch-mode adaptation.

2606.16546 2026-06-16 eess.AS cs.SD 交叉投稿

Confidence Score Guided Incremental and Speaker Adaptive Pseudo-Labeling for Semi-Supervised Elderly Speech Recognition

置信度评分引导的增量式和说话人自适应伪标注用于半监督老年人语音识别

Chengxi Deng, Xurong Xie, Shujie Hu, Jiajun Deng, Mengzhe Geng, Youjun Chen, Huimeng Wang, Haoning Xu, Guinan Li, Xunying Liu

发表机构 * The Chinese University of Hong Kong, Hong Kong SAR, China(香港中文大学) Institute of Software, Chinese Academy of Sciences, China(中国科学院软件研究所) National Research Council Canada, Canada(加拿大国家研究理事会)

AI总结 提出一种置信度评分引导的增量式和说话人自适应伪标注方法,通过渐进式高质量伪标签选择和说话人自适应训练,在老年人语音识别中分别降低词错误率1.45%和字符错误率2.27%。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

本文提出了一种新颖的置信度评分引导的增量式和说话人自适应伪标注方法,用于半监督老年人语音识别。该方法促进了更高质量的伪标签选择和渐进式优化,同时减轻了说话人异质性。设计了一个置信度估计模块来对未转录数据的可靠性进行排序,从而实现从高置信度到低置信度逐步引入未标记数据子集的课程学习轨迹。通过带有可学习提示的说话人自适应训练来捕获说话人特定特征。在英语DementiaBank Pitt和粤语JCCOCC MoCA老年人语音数据集上的实验表明,所提出的方法相比不使用置信度评分引导的增量式或说话人自适应伪标注的半监督基线,在词错误率(WER)或字符错误率(CER)上取得了统计显著的降低,绝对降低分别为1.45%和2.27%(相对降低6.21%和6.98%)。

英文摘要

This paper proposes a novel confidence score guided incremental and speaker adaptive pseudo-labeling approach for semi-supervised elderly speech recognition. It facilitates higher-quality pseudo-label selection and progressive refinement, while also mitigating speaker heterogeneity. A confidence estimation module is designed to rank the reliability of untranscribed data, enabling a curriculum learning trajectory that progressively folds in unlabeled data subsets from high to low confidence. Speaker-specific characteristics are captured through speaker adaptive training with learnable prompts. Experiments on the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets suggest that the proposed method outperforms the semi-supervised baseline using no confidence scores guided incremental or speaker adaptive pseudo-labeling by statistically significant word error rate (WER) or character error rate (CER) reductions of 1.45% and 2.27% absolute (6.21% and 6.98% relative).

2501.17615 2026-06-16 cs.CL cs.SD eess.AS 版本更新

Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition

跨语言嵌入聚类用于低资源多语言语音识别中的分层Softmax

Zhengdong Yang, Qianying Liu, Sheng Li, Fei Cheng, Chenhui Chu

AI总结 提出一种基于跨语言嵌入聚类构建分层Softmax解码器的方法,通过共享相似令牌表示提升低资源多语言语音识别精度。

Comments Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)

详情
Journal ref
in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 33, pp. 4226-4238, 2025
AI中文摘要

我们提出了一种新颖的方法,聚焦于自动语音识别(ASR)的解码阶段,以增强多语言性能,特别是对于低资源语言。该方法利用跨语言嵌入聚类方法构建分层Softmax(H-Softmax)解码器,使得不同语言中的相似令牌能够共享相似的解码器表示。它解决了先前基于Huffman的H-Softmax方法的局限性,该方法在令牌相似性评估中依赖浅层特征。通过在15种语言的下采样数据集上的实验,我们证明了该方法在提高低资源多语言ASR准确性方面的有效性。

英文摘要

We present a novel approach centered on the decoding stage of Automatic Speech Recognition (ASR) that enhances multilingual performance, especially for low-resource languages. It utilizes a cross-lingual embedding clustering method to construct a hierarchical Softmax (H-Softmax) decoder, which enables similar tokens across different languages to share similar decoder representations. It addresses the limitations of the previous Huffman-based H-Softmax method, which relied on shallow features in token similarity assessments. Through experiments on a downsampled dataset of 15 languages, we demonstrate the effectiveness of our approach in improving low-resource multilingual ASR accuracy.

10. 数据集、基准与评测 12 篇

2606.14784 2026-06-16 cs.SD cs.LG eess.AS 新提交

LLM-Based Synthetic Ground Truth Generation for Audio-Based Emotion Classification via In-Context Learning

基于上下文学习的音频情感分类的LLM合成真实标签生成

Qing Huang, Pooja Pol, Jianing Zhang

发表机构 * School of Business, Technical University of Applied Sciences Augsburg(应用技术大学阿沙芬堡商学院) Data Science und Autonome Systeme Technologietransferzentrum (TTZ)(数据科学与自主系统技术转移中心(TTZ))

AI总结 提出利用大语言模型(LLM)和上下文学习(ICL)从多用户VR环境的流式语音数据中自动生成情感相关合成真实标签,解决团队协作状态标注难题。

Comments Proceedings of the International Conference on Applied Innovations in IT (ICAIIT), April 2026

详情
AI中文摘要

理解人类状态和交互动态是人机交互(HCI)的核心目标。随着交互范式变得更加沉浸,虚拟现实(VR)已成为研究协作工作的强大平台。在此类环境中,评估团队协作状态(包括团队表现和团队韧性)需要从多模态传感器数据(如语音信号)中连续可靠地推断潜在的团队级认知和情感状态。然而,由于传感器噪声、上下文变异性和稀疏的专家标注,为这些潜在状态生成真实标签仍然具有挑战性。传统的自我报告方法仅提供静态和延迟的测量,因此不足以捕捉连续语音数据中反映的动态团队过程。在这项工作中,我们提出了一种由大语言模型(LLM)驱动的、基于代理的推理工作流,用于从多用户VR环境中的流式语音数据自动生成情感相关的合成真实标签。利用LLM的泛化能力,我们使用上下文学习(ICL)和少量配对的音频样本及其对应转录的演示。ICL倾向于实现与模型微调相当的任务适应,同时避免了参数更新的计算开销。为了构建信息丰富且鲁棒的上下文提示,我们采用基于检索的选择策略,根据声学特征空间中的相似性动态识别相关的音频演示。

英文摘要

Understanding human states and interaction dynamics is a core goal of human-computer interaction (HCI). As interaction paradigms become more immersive, virtual reality (VR) has emerged as a powerful platform for studying collaborative work. In such settings, evaluating team collaboration states, including team performance and team resilience, requires continuous and reliable inference of latent team-level cognitive and affective states from multi-modal sensor data, such as speech signals. However, generating ground truth labels for these latent states remains challenging due to sensor-induced noise, contextual variability, and sparse expert annotations. Traditional self-reporting approaches provide only static and delayed measurements and are therefore insufficient for capturing dynamic team processes reflected in continuous speech data. In this work, we propose a large language model (LLM)-driven, agentic inference workflow for automated emotion-related synthetic ground truth generation from streaming speech data in multi-user VR environments. Leveraging the generalization capabilities of LLMs, we use In-Context Learning (ICL) with few-shot demonstrations of paired audio-based samples and their corresponding transcriptions. ICL tends to achieve task adaptation comparable to model fine-tuning while circumventing the computational overhead of parameter updates. To construct informative and robust in-context prompts, we adopt a retrieval-based selection strategy that dynamically identifies relevant audio demonstrations based on similarity in the acoustic feature space.

2606.14820 2026-06-16 cs.SD cs.AI cs.CL eess.AS 新提交

Spectro-Temporal Interference Confounds Phase Encoding in Spatial Audio Foundation Models

频谱-时间干扰混淆空间音频基础模型中的相位编码

Yuxuan Chen, Haoyuan Yu, Peize He

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Jilin University(吉林大学) Hunan University(湖南大学) University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出基于双耳掩蔽级差的心理声学基准,评估空间自监督音频模型对微秒级耳间相位精细结构的编码能力,发现通用双耳SSL模型依赖频谱-时间干扰纹理而非真实相位计算。

Comments Accepted to INTERSPEECH 2026; 6 pages, 3 figures

详情
AI中文摘要

最近的空间自监督音频模型在定位任务上取得了高性能,引发了对它们编码微秒级耳间相位精细结构能力的疑问。我们提出了一个基于双耳掩蔽级差的心理声学基准来评估这一点。使用均衡抵消基线和GCC-PHAT阳性对照,我们评估了九个冻结的音频模型,涵盖双耳SSL、单耳SSL和神经音频编解码器。四个单耳阴性对照产生零BMLD,确认了双耳特异性。两个通用双耳SSL模型表现出最小的相位敏感性,而专用双耳空间SSL模型实现了与分析基线相当的BMLD。渐进式物理消融实验表明,通用双耳SSL模型依赖于频谱-时间干扰纹理而非跨通道相位计算。语音中的高检测率反映了对宽带包络而非真实相位编码的混淆依赖。

英文摘要

Recent spatial self supervised audio models achieve high performance on localization tasks, raising questions about their encoding of microsecond interaural phase fine structures. We propose a psychoacoustic benchmark based on the binaural masking level difference to evaluate this. Using an equalization cancellation baseline and a GCC PHAT positive control we evaluate nine frozen audio models spanning binaural SSL, monaural SSL, and neural audio codecs. Four monaural negative controls yield zero BMLD confirming binaural specificity. Two general purpose binaural SSL models exhibit minimal phase sensitivity while dedicated binaural spatial SSL models achieve BMLD comparable to the analytical baseline. Progressive physical ablations show that general purpose binaural SSL models rely on spectro temporal interference textures rather than cross channel phase computation. High detection rates in speech reflect a confounding reliance on broadband envelopes rather than genuine phase encoding.

2606.15751 2026-06-16 cs.SD cs.LG cs.MM eess.AS 新提交

Acoustic Prompting via Stage-wise Modulation for Few-Shot Learning in Audio Language Models

通过阶段调制进行声学提示以实现音频语言模型中的少样本学习

Hyebin Cho, Jaehyuk Jang, Changick Kim, Joon Son Chung

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 提出在音频编码器中引入可训练提示以捕获任务特定声学特征,与文本提示结合提升少样本适应性能,在11个数据集上验证有效性。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

音频-语言模型(ALMs)通过将音频波形与文本对齐,在零样本音频分类中取得了显著成功。最近改进下游性能的努力集中在学习最优文本提示上。然而,先前的方法侧重于文本编码器,忽略了音频编码器中可学习提示的潜力。在本文中,我们提出了一种新颖框架,将可训练提示引入音频编码器以捕获任务特定的声学特征。我们证明,将音频侧提示学习与现有文本侧方法相结合可以增强少样本适应。通过在11个数据集上的广泛实验表明,将我们的方法作为即插即用模块与现有文本提示调优相结合通常能带来性能提升。这些发现表明,显式调制音频表示空间可以有效补充仅文本提示方法。代码可在 https://github.com/hyebin-c/aspl 获取。

英文摘要

Audio-Language Models (ALMs) have shown remarkable success in zero-shot audio classification by aligning audio waveforms with text. Recent efforts to improve downstream performance focus on learning optimal text prompts. However, previous approaches focus on the text encoder, leaving the potential of learnable prompts within the audio encoder unexplored. In this paper, we propose a novel framework that introduces trainable prompts into the audio encoder to capture task-specific acoustic features. We demonstrate that integrating audio-side prompt learning with existing text-side approaches enhances few-shot adaptation. Through extensive experiments across 11 datasets show that integrating our method as a plug-and-play module alongside existing text prompt tuning generally leads to performance improvements. These findings suggest that explicitly modulating the audio representation space effectively complements text-only prompting approaches. The code is available at https://github.com/hyebin-c/aspl.

2606.15888 2026-06-16 cs.SD cs.AI eess.AS 新提交

NVMOS: Non-Verbal Vocalization Quality Assessment in Speech

NVMOS:语音中非语言发声质量评估

Jialong Mai, Jinxin Ji, Xiaofen Xing, Wencui Liu, Xiangmin Xu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对非语言发声(如笑声、叹息)的感知质量评估空白,构建NV-MOS数据集,提出首个专用模型NVMOS,通过局部聚焦模块达到专家级评估一致性。

Comments 6 pages. Code and model: https://github.com/yongaifadian1/NVMOS

详情
AI中文摘要

非语言发声(NVs),如笑声、叹息和咳嗽,是情感和意图的重要声学线索。现有的语音质量评估方法通常关注整体自然度,而非语言TTS评估主要检查目标NV是否以正确的类型和位置出现。然而,NV事件本身的感知质量仍未被充分探索。为填补这一空白,我们构建了一个NV-MOS数据集,包含来自多个NV-TTS系统的输出和自然发生的NV样本,并由三位声学专家根据感知质量量表进行评分。我们进一步分析了支持音频的多模态大语言模型(如Gemini),发现其评分与专家评分之间存在明显不一致。这些结果表明,通用多模态模型无法可靠地替代人类进行NV质量评估。随后,我们提出了NVMOS,据我们所知,这是第一个能够可靠预测语音中NV事件感知质量的模型。实验结果表明,通过局部NV事件聚焦模块,NVMOS达到了与人类MOS评分专家级或更强的一致性。

英文摘要

Non-verbal vocalizations (NVs), such as laughter, sighs, and coughs, are important acoustic cues for emotion and intent. Existing speech quality assessment methods typically focus on overall naturalness, while non-verbal TTS evaluations mainly examine whether a target NV appears with the correct type and position. However, the perceptual quality of NV events themselves remains underexplored. To address this gap, we construct an NV-MOS dataset containing outputs from multiple NV-TTS systems and naturally occurring NV samples, with ratings collected from three acoustic experts on a perceptual quality scale. We further analyze audio-capable multimodal large language models such as Gemini and find clear inconsistencies between their scores and expert ratings. These results suggest that general-purpose multimodal models cannot reliably replace human judgments for NV quality assessment. We then propose NVMOS, to our knowledge the first model that can reliably predict the perceptual quality of NV events in speech. Experimental results show that, with a local NV-event focusing module, NVMOS reaches expert-level or stronger agreement with human MOS.

2606.16327 2026-06-16 cs.SD cs.AI eess.AS 新提交

ArtBoost: Synthetic Articulatory Data Augmentation for Acoustic-to-Articulatory Inversion

ArtBoost: 用于声学到发音逆映射的合成发音数据增强

Hyung Kyu Kim, Byungchan Hwang, Hak Gu Kim

发表机构 * Anonymous 1(匿名机构1)

AI总结 提出ArtBoost数据增强策略,利用大规模语音-网格数据集提取伪发音轨迹进行预训练,在有限EMA数据下提升声学到发音逆映射性能,PCC和RMSE一致改善。

Comments Accepted in Interspeech26

详情
AI中文摘要

最近的声学到发音逆映射(AAI)模型依赖于电磁发音描记术(EMA)数据,这些数据成本高昂且规模有限。为了解决这一限制,我们提出了\textit{ArtBoost},一种新颖的数据增强策略,利用最初为语音驱动的3D面部动画开发的大规模语音-网格数据集,在有限的EMA监督下改进AAI。\textit{ArtBoost}从可见的面部锚点提取伪发音轨迹,并在真实EMA数据上微调之前用于预训练。实验显示PCC和RMSE一致改善。轨迹分析证实伪发音信号反映了物理上有意义的可见发音动态。在不同AAI架构上的额外评估表明稳定的性能提升,表明\textit{ArtBoost}可以集成到多种AAI模型中。这些结果表明语音-网格数据为AAI提供了一种有效且可扩展的发音监督来源。项目页面:https://cau-irislab.github.io/Interspeech26-ArtBoost/

英文摘要

Recent acoustic-to-articulatory inversion (AAI) models rely on electromagnetic articulography (EMA) data, which are costly and limited in scale. To address this limitation, we propose \textit{ArtBoost}, a novel data augmentation strategy that leverages large-scale speech--mesh datasets originally developed for speech-driven 3D facial animation to improve AAI under limited EMA supervision. \textit{ArtBoost} extracts pseudo articulatory trajectories from visible facial anchors and uses them for pre-training before fine-tuning on real EMA data. Experiments show consistent improvements in PCC and RMSE. Trajectory analyses confirm that the pseudo articulatory signals reflect physically meaningful visible articulatory dynamics. Additional evaluations across different AAI architectures demonstrate stable performance gains, indicating that \textit{ArtBoost} can be integrated into diverse AAI models. These results suggest that speech--mesh data provide an effective and scalable source of articulatory supervision for AAI. Project page: https://cau-irislab.github.io/Interspeech26-ArtBoost/

2606.16969 2026-06-16 cs.SD cs.AI eess.AS 新提交

Probing Low Frame Rate Degradation in Neural Audio Codecs

探测神经音频编解码器中的低帧率退化

Alex Gichamba, Moise Busogi

发表机构 * Carnegie Mellon University Africa(卡内基梅隆大学非洲校区)

AI总结 通过控制帧率消融实验,发现低帧率质量悬崖源于训练配置缺陷而非根本性障碍,修正后帧率可降至3.1Hz和1.6Hz。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

神经音频编解码器中的低帧率对于自回归语音合成具有吸引力,因为生成成本与序列长度线性相关。最近的研究表明,编解码器可以在12.5 Hz及以下运行,但低帧率退化的机制仍未被充分理解。我们通过受控的帧率消融实验来研究这些机制。我们重现了先前工作中报告的6.25 Hz处的质量悬崖,并评估了候选解释:音素冲突和码本饱和,两者均未显示出根本性障碍的证据。该悬崖实际上是由次优的训练配置引起的:训练期间固定的剪辑时长在低帧率下产生过少的令牌,使解码器缺乏令牌间上下文。一旦修正,WER随音素负载平滑退化,直至3.1 Hz和1.6 Hz,这表明低帧率编解码器的推理时效率增益比先前假设的更容易实现。

英文摘要

Low frame rates in neural audio codecs are attractive for autoregressive speech synthesis, where the generation cost scales linearly with the sequence length. Recent work has demonstrated that codecs can operate at 12.5 Hz and below, but the mechanisms underlying low frame rate degradation remain insufficiently understood. We investigate these mechanisms through a controlled frame rate ablation. We reproduce a quality cliff at 6.25 Hz reported in previous works and evaluate candidate explanations: phonemic collisions and codebook saturation, neither of which shows evidence of a fundamental barrier. The cliff is instead caused by suboptimal training configuration: fixed clip duration during training yields too few tokens at low frame rates, starving the decoder of inter-token context. Once corrected, WER degrades smoothly with phonemic load down to 3.1 Hz and 1.6 Hz, suggesting the inference-time efficiency gains of low frame rate codecs are more accessible than previously assumed.

2606.14791 2026-06-16 eess.AS cs.LG cs.SD 交叉投稿

From Physics to Representation: Audio Learning with Synthetic Pre-training via Procedural Generation

从物理到表示:通过程序化生成进行合成预训练的音频学习

Fengrui Liu, Ruiyang Huang, Qijian Zheng, Yuanfang Wang, Feng Liu

发表机构 * East China Normal University(华东师范大学) Southeast University(东南大学) Fudan University(复旦大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出AudioPG框架,利用程序化合成生成波形进行掩码自编码器预训练,无需真实音频数据,在多个基准上取得高精度,且单GPU训练不到20分钟。

Comments Accepted to ACM ICMR 2026

详情
AI中文摘要

自监督学习推动了多媒体分析中音频表示的发展。然而,主流的数据驱动方法依赖大规模真实世界语料库,增加了训练成本、整理负担和隐私障碍。为解决这一问题,我们提出了AudioPG,一个程序化合成框架,在预训练过程中完全消除了真实音频录音。AudioPG在由基本声学基元和组合规则实时生成的波形上训练基于Transformer的掩码自编码器。该编码器有效迁移到真实音频基准,在ESC-50上达到90.60%的准确率,在FSD50K上达到0.546 mAP,在UrbanSound8K上达到88.17%,在Speech Commands V2上达到97.03%。值得注意的是,预训练在单个GPU上不到20分钟即可完成。潜在空间分析揭示了物理因素(包括基频和相对强度)在正交子空间中出现,使得表示可线性解码。这些结果表明,当大规模语料库不可用时,程序化合成是一种高效、可解释的预训练信号。我们的代码可在https://github.com/Freyliu0516/audioPG获取。

英文摘要

Self-supervised learning advances audio representation for multimedia analysis. However, prevailing data-centric approaches rely on massive real-world corpora, increasing training costs, curation burdens, and privacy barriers. To address this, we present AudioPG, a procedural synthesis framework eliminating real audio recordings during pre-training. AudioPG trains a Transformer-based masked autoencoder on waveforms generated on-the-fly from basic acoustic primitives and composition rules. The encoder transfers effectively to real audio benchmarks, achieving 90.60% accuracy on ESC-50, 0.546 mAP on FSD50K, 88.17% on UrbanSound8K, and 97.03% on Speech Commands V2. Notably, pre-training completes in under 20 minutes on a single GPU. Latent space analysis reveals physical factors, including fundamental frequency and relative intensity, emerge in orthogonal subspaces, making representations linearly decodable. These results establish procedural synthesis as an efficient, interpretable pre-training signal when large-scale corpora are unavailable. Our code is available at: https://github.com/Freyliu0516/audioPG.

2606.15187 2026-06-16 eess.AS cs.SD 交叉投稿

VoxWatermark: A Large-Scale Benchmark for Audio Watermark Detection under Perturbations

VoxWatermark: 一个用于扰动下音频水印检测的大规模基准

Farnaz Sedaghati, Yuxi Wang, Zicheng Weng, Wei Rao

发表机构 * University of Tehran, Iran(伊朗德黑兰大学) Nanyang Technological University, Singapore(新加坡南洋理工大学)

AI总结 为解决缺乏统一基准的问题,构建VoxWatermark,包含10种水印方法和三种扰动类型,并提出鲁棒检测器AudioWMD,验证了其有效性和可扩展性。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

随着语音生成系统在开放环境中的快速部署,为音频内容提供可验证的来源归属和版权问责变得至关重要。当前研究的一个空白是缺乏一个统一的基准,能够在现实分布偏移下系统地比较不同的水印注入方法。为此,我们构建了VoxWatermark,通过在多语言、多源语料库上应用10种水印方法(4种神经方法和6种传统方法),并采用统一的注入和标注,同时引入无盒、黑盒和白盒扰动来模拟真实的录制和传输条件。基于该基准,我们提出了AudioWMD,作为大规模、多方法、跨分布设置下的鲁棒基线检测器。结果表明,注入方法的多样性和分布偏移会影响检测稳定性,同时验证了AudioWMD的有效性和可扩展性。数据集和代码已公开。

英文摘要

With the rapid deployment of speech generation systems in open environments, providing verifiable source attribution and copyright accountability for audio content has become critical. A gap in current research is the lack of a unified benchmark that systematically compares different watermark injection methods under realistic distribution shifts. To address this, we build VoxWatermark by applying 10 watermarking methods (4 neural and 6 traditional) with unified injection and annotation on multilingual, multi-source corpora, and introducing no-box, black-box, and white-box perturbations to simulate real recording and transmission conditions. Based on this benchmark, we propose AudioWMD as a robust baseline detector for large-scale, multi-method, cross-distribution settings. Results show that injection-method diversity and distribution shifts affect detection stability, while validating the effectiveness and scalability of AudioWMD. Dataset and code are publicly available.

2606.15968 2026-06-16 eess.AS cs.SD 交叉投稿

Bridging the SEA Gap: An Initial Benchmark for Neural Audio Codec-Synthesized Speech Deepfakes in South-East Asian Languages

弥合SEA差距:南亚语言神经音频编码合成语音深度伪造的初步基准

Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Arun Balaji Buduru

发表机构 * IIIT-Delhi, India(印度德里国家理工学院) UPES, India(印度UPES大学) VBSPU, India(印度VBSPU大学)

AI总结 针对南亚语言神经音频编码合成语音深度伪造检测缺乏基准的问题,构建首个大规模基准SEA-CF,并提出轻量级小模型GARUDA,在低资源场景下实现高效检测。

Comments Accepted to IJCAI-ECAI 2026

详情
AI中文摘要

Codecfakes (CFs) 是通过音频语言模型 (ALMs) 生成的一种语音深度伪造,神经音频编码器 (NACs) 是其语音编码和生成的核心机制。CFs 表现出与基于声码器的深度伪造不同的分布特征,导致在声码器数据上训练的检测器难以泛化到 CFs 检测。尽管这推动了 CF 检测基准的发展,但现有资源主要局限于英语(以及有限程度的中文),南亚 (SEA) 语言尚未被探索。为弥合这一差距,我们引入了 SEA-CF,这是首个覆盖多种南亚语言、多样说话人画像和多种 NAC 架构的大规模 CF 检测基准。SEA-CF 通过合成公开的真实语音语料库构建。实验表明,在英语中心数据集上训练的最先进 (SOTA) CF 检测器由于语言特定的语音结构、音调变化和丰富的韵律多样性,无法泛化到南亚语音。我们进一步在 SEA-CF 上对近期 SOTA ALMs 进行了全面的零样本和微调评估。微调 ALMs 提升了性能,但这些模型非常大,由于其规模,在低资源和延迟受限的实际应用中不切实际。为解决这一限制,我们提出了一种专为 CF 检测定制的新型小 ALM——GARUDA,它在保持轻量级的同时实现了强劲性能。大量评估表明,所提出的小 ALM 优于强端到端和基于 ALM 的基线,为南亚语言及其他语言的鲁棒 CF 检测建立了一个新的实用方向。

英文摘要

Codecfakes (CFs) are a type of speech deepfakes generated through Audio Language Models (ALMs), with Neural Audio Codecs (NACs) forming the core mechanism for speech encoding and generation. CFs exhibit distributional characteristics that differ from vocoder-based deepfakes, causing detectors trained on vocoder data to generalize poorly to CFs detection. Although this has led to the development of CF detection benchmarks, existing resources are largely confined to English -- and to a limited extent Chinese -- leaving South-East Asian (SEA) languages unexplored. To bridge this gap, we introduce SEA-CF, the first large-scale benchmark for CF detection spanning multiple SEA languages, diverse speaker profiles, and a wide range of NAC architectures. SEA-CF is constructed by synthesizing publicly available real speech corpora. Our experiments show that state-of-the-art (SOTA) CF detectors trained on English-centric datasets fail to generalize to SEA speech due to language-specific phonetic structures, tonal variations, and rich prosodic diversity. We further conduct a comprehensive zero-shot and fine-tuned evaluation of recent SOTA ALMs on SEA-CF. Fine-tuning the ALMs improves performance, however, these are very large being impractical for real-world application due to their scale, particularly in low-resource and latency-constrained settings. To address this limitation, we propose a novel small-ALM, GARUDA tailored for CF detection, which delivers strong performance while remaining lightweight. Extensive evaluations demonstrate that the proposed Small-ALM outperforms strong end-to-end and ALM-based baselines, establishing a new, practical direction for robust CF detection in SEA languages and beyond.

2606.16019 2026-06-16 cs.CL cs.LG cs.SD 交叉投稿

Scaling Human and G2P Supervision for Robust Phonetic Transcription

扩展人类与G2P监督以实现鲁棒语音转录

Alexander Metzger, Aruna Srivastava, Ruslan Mukhamedvaleev

发表机构 * Koel Labs LLC

AI总结 研究自动语音转录中人类标注与G2P监督的扩展规律,发现当人类标注少于20-30小时时G2P有效,超过后无益甚至降低鲁棒性,而ASR预训练可显著提升性能。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

专家语音标注成本高昂,尤其对于非标准方言和非典型语音。一种常见替代方法是使用字素到音素(G2P)模型从文本转录中自动生成语音标签。我们研究了自动语音转录性能如何随英语中人类和G2P监督的扩展而变化。使用一个涵盖母语、非母语和卒中后语音的精心策划的80小时基准测试,我们确定了一个监督质量阈值:只有当人类标注少于20-30小时时,G2P监督才有帮助。超过此阈值,它不提供显著益处,并可能降低跨方言鲁棒性。在此阈值之后有效的是ASR预训练,我们使用它实现了比先前系统加权音素特征错误率降低2.3倍,在非母语和失语症语音上取得了强劲提升。这些结果表明,数量驱动的G2P扩展可能对鲁棒泛化产生递减收益。

英文摘要

Expert phonetic annotation is costly, especially for non-standard dialects and atypical speech. A common alternative is using Grapheme-to-Phoneme (G2P) models to auto-generate phonetic labels from text transcripts at scale. We study how automatic phonetic transcription performance scales with human and G2P supervision in English. Using a curated 80-hour benchmark spanning native, non-native and post-stroke speech, we identify a supervision quality threshold: G2P supervision helps only when fewer than 20-30 hours of human annotation are available. Beyond this threshold, it provides no significant benefit and can reduce cross-dialect robustness. What is effective after this threshold is ASR pretraining which we use to achieve a 2.3x reduction in weighted phone feature error rate over prior systems, with strong gains on non-native and aphasic speech. These results suggest that quantity-driven G2P scaling may yield diminishing returns for robust generalization.

2509.16975 2026-06-16 cs.SD eess.AS 版本更新

Interpretable Audio Editing Evaluation via Chain-of-Thought Difference-Commonality Reasoning with Multimodal LLMs

基于多模态大语言模型的链式思维差异-共性推理的可解释音频编辑评估

Yuhang Jia, Xu Zhang, Yang Chen, Hui Wang, Enzhi Wang, Yong Qin

发表机构 * College of Computer Science, Nankai University, Tianjin, China(南开大学计算机科学学院,天津,中国) Academy for Advanced Interdisciplinary Studies, Nankai University, Tianjin, China(南开大学先进跨学科研究学院,天津,中国)

AI总结 提出首个基于自然语言的音频编辑自动评估框架,利用Qwen2-Audio和链式思维推理策略,实现可解释且与人类判断高度一致的评估。

详情
AI中文摘要

自动平均意见分(MOS)预测作为主观听音测试和客观指标的原则性替代方案,提供了可扩展且一致的音频评估。受LLM-as-Judge范式的启发,最近的多模态大语言模型展现出强大的感知建模和推理能力,能够进行音频质量评估。在这项工作中,我们解决了音频编辑评估这一具有挑战性的问题,并提出了首个基于自然语言的自动评估框架,该框架基于Qwen2-Audio。引入了两个基于字幕的微调任务以增强多音频理解,同时设计了一种链式思维提示策略,以鼓励结构化、逐步推理。实验表明,我们的框架产生可解释且逻辑一致的基于文本的评估,与人类判断高度一致,同时优于现有基线。代码和演示可在以下网址获取:此 https URL。

英文摘要

Automatic mean opinion score (MOS) prediction serves as a principled alternative to both subjective listening tests and objective metrics, providing scalable and consistent audio evaluation. Inspired by the LLM-as-Judge paradigm, recent multimodal large language models offer strong perceptual modeling and reasoning capabilities, enabling audio quality assessment. In this work, we address the challenging problem of audio editing evaluation and propose the first natural language-based automated evaluation framework built upon Qwen2-Audio. Two caption-based fine-tuning tasks are introduced to enhance multi-audio understanding, together with a designed Chain-of-Thought prompting strategy to encourage structured, step-by-step reasoning. Experiments show that our framework produces interpretable and logically consistent text-based evaluations, aligning closely with human judgments while outperforming existing baselines. The code and demo are available at https://github.com/NKU-HLT/Eval_Reasoning.

2604.16211 2026-06-16 cs.SD 版本更新

NVV-SuperBench: Beyond Words, Beyond Quality-Benchmarking Nonverbal Vocalizations in Speech Generation

NVV-SuperBench:超越语言,超越质量——语音生成中非语言发声的基准测试

Liumeng Xue, Weizhen Bian, Jiahao Pan, Wenxuan Wu, Yilin Ren, Boyi Kang, Jingbin Hu, Ziyang Ma, Shuai Wang, Xinyuan Qian, Hung-yi Lee, Yike Guo

发表机构 * Nanjing University(南京大学) The Hong Kong University of Science and Technology(香港科学与技术大学) The Chinese University of Hong Kong(香港中文大学) University of Science and Technology Beijing(北京科技大学) Northwestern Polytechnical University(西北工业大学) Shanghai Jiao Tong University(上海交通大学) National Taiwan University(国立台湾大学)

AI总结 提出NVV-SuperBench基准,包含45类非语言发声的统一分类和多轴评估协议,用于测试语音生成系统在非语言发声控制、放置和感知显著性方面的能力,发现当前系统在低信噪比口部线索和长时情感发声方面存在瓶颈。

Comments Accepted as a long paper at INTERSPEECH 2026

详情
AI中文摘要

非语言发声(如笑、叹气、抽泣)对于类人语音至关重要,但标准化评估很少联合评估系统是否生成预期的非语言发声、正确放置它们并保持其显著性而不损害语音。我们提出了NVV-SuperBench,一个用于带非语言发声的语音生成的双语(英语/中文)基准。它提供了统一的45类分类法和超越传统语音质量评估的多轴协议,评估非语言发声特定的可控性、放置和感知显著性。我们对15个语音生成系统(涵盖基于提示和基于标签的控制范式)进行了基准测试,使用了客观指标、人类听力测试和基于LLM的多评估者评分。结果表明,非语言发声的可控性通常与语音质量解耦,而低信噪比口部线索和长时情感非语言发声仍然是瓶颈。NVV-SuperBench突出了当前的差距,并支持向更类人的语音生成迈进。

英文摘要

Nonverbal vocalizations (NVVs), such as laughing, sighing, and sobbing, are essential for human-like speech, yet standardized evaluation rarely jointly assesses whether systems generate the intended NVVs, place them correctly, and keep them salient without harming speech. We present NVV-SuperBench, a bilingual English/Chinese benchmark for speech generation with NVVs. It provides a unified 45-type taxonomy and a multi-axis protocol beyond conventional speech quality assessment, evaluating NVV-specific controllability, placement, and perceptual salience. We benchmark 15 speech generation systems spanning prompt-based and tag-based control paradigms, using objective metrics, human listening tests, and LLM-based multi-rater evaluation. Results show that NVV controllability often decouples from speech quality, while low-SNR oral cues and long-duration affective NVVs remain bottlenecks. NVV-SuperBench highlights current gaps and supports progress toward more human-like speech generation.

11. 安全、隐私与深度伪造音频 7 篇

2606.16532 2026-06-16 cs.SD cs.AI 新提交

Dual-Granularity Orthogonal Disentanglement for Generalizable Audio Deepfake Detection

双粒度正交解耦用于可泛化的音频深度伪造检测

Zhuodong Liu, Hugen Lv, Xiangyu Li, Chunhong Yuan

发表机构 * Beijing Jiaotong University(北京交通大学) Shanghai Jiao Tong University(上海交通大学) ITMO University(ITMO大学)

AI总结 针对音频深度伪造检测中隐式身份泄漏问题,提出双粒度正交解耦框架,通过样本级余弦正交性和批次级交叉协方差正则化强制特征独立,无需辅助网络或对抗训练,在多个数据集上取得更优等错误率。

Comments Accepted at Interspeech 2026, 6 pages, 3 figures

详情
AI中文摘要

音频深度伪造检测器常常无法跨说话人泛化,因为它们学习的是说话人身份特征而非合成伪影,这被称为隐式身份泄漏。现有方法解决了这一问题,但引入了架构复杂性或训练不稳定性。本文提出了一种双粒度正交解耦框架,在两个层次上强制特征独立性:样本级余弦正交性捕获方向去相关,而批次级交叉协方差正则化消除嵌入维度间的线性相关性。课程解耦调度逐步增强正交约束,无需辅助网络或对抗动态。在ASVspoof 2019 LA、ASVspoof 2021 DF和In-the-Wild数据集上的实验表明,所提方法分别实现了1.35%、7.88%和21.58%的等错误率(EER),在跨数据集迁移上比梯度反转解耦绝对提升了2.60%。

英文摘要

Audio deepfake detectors often fail to generalize across speakers, as they learn speaker-identity features rather than synthesis artifacts, known as implicit identity leakage. Existing methods address this but incur architectural complexity or training instability. This paper proposes a dual-granularity orthogonal disentanglement framework enforcing feature independence at two levels: sample-level cosine orthogonality captures directional decorrelation, while batch-level cross-covariance regularization eliminates linear correlations across embedding dimensions. A curriculum disentanglement schedule progressively strengthens the orthogonality constraint without auxiliary networks or adversarial dynamics. Experiments on ASVspoof 2019 LA, ASVspoof 2021 DF, and In-the-Wild datasets demonstrate that the proposed method achieves 1.35%, 7.88%, and 21.58% equal error rates (EER), respectively, surpassing gradient reversal disentanglement by 2.60% absolute on cross-dataset transfer.

2606.15117 2026-06-16 cs.MM cs.AI cs.CV cs.LG cs.SD 交叉投稿

Teacher-Student Structure for Domain Adaptation in Ensemble Audio-Visual Video Deepfake Detection

用于集成视听视频深度伪造检测中领域适应的师生结构

Elham Abolhasani, Maryam Ramezani, Hamid R. Rabiee

发表机构 * Department of Computer Engineering, Sharif University of Technology(谢里夫理工学院计算机工程系)

AI总结 提出EAV-DFD方法,结合师生框架的领域适应机制,提升模型在未见领域上的泛化能力,在三个数据集上AUC分别提升4.09%、17.94%和0.5%。

详情
AI中文摘要

生成式AI模型的快速发展导致了更逼真的深度伪造媒体,包括对音频、视频或两者的操纵。这引发了严重的隐私和社会问题。该领域的许多研究已经取得了有前景的域内结果;然而,这些模型在面对来自不同领域的数据时,其有效性常常下降。因此,最近的深度伪造检测方法侧重于通过多种技术增强泛化能力,这些技术融合了所有输入模态,包括音频、图像及其交互。为此,我们提出了EAV-DFD方法,一种广义的深度集成视听模型(EAV-DFD),结合了利用师生框架的领域适应机制,以增强模型在未见领域上的表现和泛化能力。为了评估模型性能,我们使用FakeAVCeleb数据集作为主领域,DFDC、Deepfake_TIMIT和PolyGlotFake数据集作为未见领域。我们的实验结果表明,所提出的框架在领域适应方面是有效的,仅使用一小部分未见数据集训练学生模型,就在三个未见数据集上分别将模型的AUC性能提升了4.09%、17.94%和0.5%。这产生了一种新颖的深度伪造检测模型,能够适应新领域并解释哪个模态被操纵,突显了我们的方法在现实世界应用中的潜力。

英文摘要

The rapid advancement of generative AI models is leading to more realistic deepfake media, encompassing the manipulation of audio, video, or both. This raises severe privacy and societal concerns. Numerous studies in this area have yielded promising intra-domain results; however, these models frequently exhibit decreased efficacy when faced with data from dissimilar domains. Consequently, recent deepfake detection approaches focus on enhancing the generalization ability through multiple techniques that incorporate all input modalities, including audio, images, and their interactions. In this regard, we propose the EAV-DFD method, a generalized deep ensemble audio-visual model (EAV-DFD) combined with a domain adaptation mechanism utilizing a teacher-student framework to enhance the model's ability to perform and generalize effectively across unseen domains. To evaluate the model's performance, we used the FakeAVCeleb dataset as the primary domain and the DFDC, Deepfake_TIMIT, and PolyGlotFake datasets as an unseen domain. Our experimental results demonstrate that the proposed framework is efficient in domain adaptation, improving AUC performance of the model by 4.09%, 17.94%, and 0.5% on three unseen datasets, using only a small portion of them to train the student model. This leads to a novel deepfake detection model capable of adapting to new domains and interpreting which modality has been manipulated, highlighting the potential of our approach for real-world applications.

2606.15264 2026-06-16 eess.AS cs.SD 交叉投稿

DuraMark: Duration-Embedded Watermarking in LLM-based TTS

DuraMark: 基于LLM的文本转语音中的时长嵌入水印

Zhenwei Mou, Weili Jiang, Liping Chen, Zhen-Hua Ling, Kong Aik Lee, Kai Gao, Boyu Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学) Institute of Forensic Science, Ministry of Public Security(公安部刑侦科学研究所) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出DuraMark,一种基于音节时长编辑的信息级水印框架,利用可控时长的LLM-TTS模型嵌入水印,并使用时长提取器检测,有效抵抗生成式攻击。

Comments Accepted to INTERSPEECH 2026. 5 pages, 1 figure. Audio samples: https://muzw.github.io/duramark_demo/

详情
AI中文摘要

基于大语言模型(LLM)的文本转语音(TTS)模型已实现显著的语音克隆能力,引发了对深度伪造滥用的担忧。语音水印通过将可追溯信息嵌入生成的语音中来缓解这一问题。主流水印方法在信号级(波形或频谱图)操作,使得水印易受生成式攻击(如神经编解码器和声码器)的影响。为解决此问题,我们提出DuraMark,一种鲁棒的信息级水印框架。它利用音节时长编辑实现水印嵌入。具体而言,DuraMark集成了一个时长可控的基于LLM的TTS模型,在合成过程中编辑音节时长,并配以时长提取器提取这些时长用于检测。实验表明,DuraMark对生成式攻击具有优越的鲁棒性,显著优于信号级基线。音频样本可在https://muzw.github.io/duramark_demo/获取。

英文摘要

Large language model (LLM)-based text-to-speech (TTS) models have achieved remarkable voice cloning capabilities, raising concerns about potential deepfake misuse. Speech watermarking mitigates this by embedding traceable information into generated speech. Mainstream watermarking methods operate at the signal level (waveform or spectrogram), rendering the watermark vulnerable to generative attacks (e.g., neural codec and vocoder). To address this, we propose DuraMark, a robust information-level watermarking framework. It utilizes syllable duration editing to achieve watermark embedding. Specifically, DuraMark integrates a duration-controllable LLM-based TTS model to edit syllable durations during synthesis, coupled with a duration extractor to extract these durations for detection. Experiments demonstrate DuraMark's superior robustness against generative attacks, significantly outperforming signal-level baselines. Audio samples are available at https://muzw.github.io/duramark_demo/.

2606.15313 2026-06-16 eess.AS cs.SD 交叉投稿

DDPO-VC: Speaker De-Identification via Diffusion Denoising Policy Optimization

DDPO-VC:基于扩散去噪策略优化的说话人去识别

Liming Wang, Cody Karjadi, Rhoda Au, James Glass

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出DDPO-VC框架,通过强化学习后训练扩散模型,利用隐私与效用教师奖励信号,在说话人去识别中平衡隐私保护与下游任务效用,优于多种强基线方法。

详情
AI中文摘要

说话人去识别的一个关键挑战是隐私与效用之间的平衡。许多效用变量,如说话人的认知健康状况,与隐私变量(如说话人身份)相关,违反了基于解耦方法所持有的独立性假设,导致私人信息泄露和下游任务有用信息丢失。为应对这一挑战,我们提出了一个通用框架DDPO-VC,通过基于强化学习的后训练与扩散模型实现说话人去识别。结合来自隐私导向和效用导向教师的奖励信号进行学习,我们的方法在两个常用的痴呆症语音基准上,在隐私保护和认知效用方面均优于各种强去识别方法。请查看我们的代码和演示。

英文摘要

A key challenge of speaker de-identification is the balance between privacy and utility. Many utility variables, such as the cognitive health status of the speaker, are correlated with the privacy variable, such as the speaker identity, violating the independence assumption held by the disentanglement-based approaches, causing leakage of private information and the loss of useful information for downstream tasks. To tackle this challenge, we propose a general framework, DDPO-VC, for speaker de-identification through reinforcement learning-based post-training with diffusion models. Learning from reward signals combining knowledge from privacy-focused and utility-focused teachers, our method outperforms various strong \deid/ methods in both privacy preservation and cognitive utility on two commonly used dementia speech benchmarks. Please check out our code\footnote{\href{https://github.com/cactuswiththoughts/DDPO-VC}{https://github.com/cactuswiththoughts/DDPO-VC}} and demo\footnote{\href{https://cactuswiththoughts.github.io/SpeakerDeID-Demo/}{https://cactuswiththoughts.github.io/SpeakerDeID-Demo/}}.

2606.15454 2026-06-16 eess.AS cs.SD 交叉投稿

Phonetically Explainable Speech Deepfake Detection

语音深度伪造检测的语音学可解释方法

Manasi Chhibber, Jagabandhu Mishra, Tomi H. Kinnunen

发表机构 * University of Eastern Finland(东方芬兰大学)

AI总结 提出音素引导的交叉注意力框架,将语音深度伪造检测转化为可解释的语音学过程,通过分解伪造后验概率实现每音素类别贡献的可视化,在多个数据集上验证了不同语音类别区分力的差异。

详情
AI中文摘要

语音深度伪造检测通常被视为一个不透明的分类任务,其中所有时间帧被平等地聚合。这忽略了不同语音类别携带的判别信息量差异巨大。为了解决这个问题,我们提出了一种音素引导的交叉注意力框架,将检测转化为一个可解释的、基于语音学的过程。我们将伪造后验概率 $P(\text{spoofed}\mid X, W)$ 分解,其中 $X$ 是声学表示,$W$ 是音素后验图。分解结果可写为 $P(\text{spoofed} \mid X, W) = \sum_{i=1}^{M} w_i \cdot P(\text{spoofed} \mid X, Z = z_i)$,其中 $M$ 表示音素类别数,$P(\text{spoofed} \mid X, Z = z_i)$ 是给定 $X$ 时第 $i$ 个音素类别 $z_i$ 的伪造概率,每个 $w_i$ 是音素类别 $z_i$ 在话语中的出现率。我们的基于Transformer的架构通过一个交叉注意力块实例化这一点,其中音素查询选择性地探测声学键和值中的信息,softmax归一化池化提供显式的音素存在权重。与先前严重依赖事后可解释性方法的工作不同,我们的框架提供了设计上的语音可解释性。我们在LJSpeech衍生语料库、ASVspoof 2019 LA和ASVspoof 5 Track 1上评估了该框架。每音素重要性排名显示,判别力集中在生成模型难以忠实再现的发音类别上。塞音、擦音、塞擦音、鼻音和静音边界闭合排名最具判别力,而周期性元音和半元音排名较低。除了有竞争力的性能外,我们的模型提供了结构可解释性,产生可检查的每发音类别最终判决分解。

英文摘要

Speech deepfake detection is predominantly treated as an opaque classification task where all temporal frames are aggregated equally. This ignores that different phonetic categories carry vastly different amounts of discriminative information. To address this, we propose a phoneme-guided cross-attention framework that transforms detection into an interpretable, phonetically grounded process. We factorize the spoofing posterior $P(\text{spoofed}\mid X, W)$, conditioned on the acoustic representation $X$ and the phonetic posteriorgram $W$. The resulting factorization can be written as $P(\text{spoofed} \mid X, W) = \sum_{i=1}^{M} w_i \cdot P(\text{spoofed} \mid X, Z = z_i)$, where $M$ denotes the number of phonetic classes, $P(\text{spoofed} \mid X, Z = z_i)$ is the spoofing probability for the $i$-th phonetic class $z_i$ conditioned on $X$, and each $w_i$ is the prevalence of phonetic class $z_i$ in the utterance. Our transformer-based architecture instantiates this through a cross-attention block in which phonetic queries selectively probe information in acoustic keys and values, with softmax-normalized pooling supplying explicit phone-presence weights. Unlike prior approaches that rely heavily on post-hoc explainability methods, our framework offers phonetic-explainability-by-design. We evaluate the framework on an LJSpeech-derived corpus, ASVspoof 2019 LA, and ASVspoof 5 Track 1. Per-phone importance rankings reveal that discriminative power concentrates on articulatory categories that generative models struggle to reproduce faithfully. Stops, fricatives, affricates, nasals, and silence-boundary closures rank most discriminative, while periodic vowels and semivowels rank lower. Beyond competitive performance, our model provides structural interpretability, yielding an inspectable per-articulatory category breakdown of the final verdict.

2606.16837 2026-06-16 cs.CV cs.AI cs.SD 交叉投稿

Robust Spoofed Speech Detection via Temporal Pyramid Modeling

基于时间金字塔建模的鲁棒语音伪造检测

Mahtab Masoudi Nezhad, Nima Karimian

发表机构 * Lane Department of Computer Science and Electrical Engineering, West Virginia University(西弗吉尼亚大学莱恩计算机科学与电气工程系) Bellini College of Artificial Intelligence, Cybersecurity and Computing, University of South Florida(南佛罗里达大学贝利尼人工智能、网络安全与计算学院)

AI总结 提出时间金字塔适配器,通过多尺度时间卷积捕获局部伪影和全局韵律异常,结合自监督XLS-R表示,在多个数据集上显著优于基线模型。

详情
AI中文摘要

伪造语音检测日益受到逼真合成、语音转换和重放攻击的挑战,跨数据集泛化仍然是主要限制。本文提出时间金字塔适配器,利用具有不同感受野的并行时间卷积来捕获多尺度伪造线索,从局部伪影到全局韵律异常。我们还集成了自监督XLS-R表示,并结合前端适配器,包括Mel、Sinc和用于多尺度时间建模的时间金字塔设计。所提出的模型在多个基准上进行了评估,包括ASVspoof 2017、ASVspoof 2021 (DF/LA)、PartialSpoof、DiffSSD和多语言HQ-MPSD数据集。实验结果表明,时间金字塔模型在PartialSpoof数据库上获得了99.24%的AUC和3.87%的EER,显著优于基础模型和多个SOTA基线,如LCNN-BLSTM(9.87% EER)和TRACE(8.08% EER)。此外,多语言评估证实,虽然伪造伪影与语言无关,但自监督表示提高了鲁棒性,在领域和语言偏移下性能下降,凸显了需要更好的适应和校准策略。

英文摘要

Spoofed speech detection is increasingly challenged by realistic synthesis, voice conversion, and replay attacks, with cross-dataset generalization remaining a major limitation. This work we propose a Temporal Pyramid Adapter that utilize parallel temporal convolutions with varying receptive fields to capture multi-scale spoofing cues, ranging from local artifacts to global prosodic irregularities. We also integrated self-supervised XLS-R representations combined with front-end adapters, including Mel, Sinc, and a Temporal Pyramid design for multi-scale temporal modeling. The proposed model is evaluated cross multiple benchmark including ASVspoof 2017, ASVspoof 2021 (DF/LA), PartialSpoof, DiffSSD, and multilingual HQ-MPSD datasets. Experimental results demonstrate that Temporal Pyramid model obtained AUC of 99.24% and a EER of 3.87% on the PartialSpoof database, which is significantly outperforming the base model and several SOTA baseline such as LCNN-BLSTM (9.87% EER) and TRACE (8.08% EER). Additionally, multilingual evaluations confirm that while spoofing artifact are independent from language. While self-supervised representations improve robustness, performance degrades under domain and language shifts, highlighting the need for better adaptation and calibration strategies.

2512.11241 2026-06-16 cs.SD 版本更新

The Affective Bridge: Preserving Speech Representations while Enhancing Deepfake Detection vian emotional Constraints

情感桥梁:通过情感约束在增强深度伪造检测的同时保留语音表征

Yupei Li, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang, Björn W. Schuller

发表机构 * University of Bristol(布里斯托大学) University of Edinburgh(爱丁堡大学) University of Technology Sydney(新南威尔士大学)

AI总结 提出仅用情感识别微调语音编码器,再训练轻量SVM进行深度伪造检测,既保留下游任务表征能力又提升检测性能,发现情感是独特的桥梁任务。

Comments Submitted to SLT 2026 for review

详情
AI中文摘要

语音深度伪造检测(DFD)受益于多种声学和语义语音表征,其中许多编码了有价值的语音信息且训练成本高昂。先前工作表明情感线索可改善DFD,但现有方法要么在复杂流程中将情感与其他任务特定特征融合,要么直接针对DFD目标微调表征,这有损支持下游任务(如说话人验证SV或自动语音识别ASR)的原始语音表征。我们提出一种更简单的方法:仅用情感识别微调语音编码器——无需任何DFD监督,并在冻结的情感调优表征上训练轻量支持向量机(SVM)用于DFD。这保留了原始表征对下游任务(如SV和ASR)的能力,同时突发性地提升了DFD性能。关键的是,我们发现情感作为此桥梁任务具有独特有效性:用说话人身份替代情感甚至会降低DFD性能,表明收益源于情感作为语音表征与DFD之间自然桥梁的作用。在FakeOrReal和In-the-Wild上的实验显示准确率提升高达6%和2%,对应EER降低,而对ASVspoof 2019 LA的分析揭示了真实语音子集中的数据集特定说话人偏差。代码见补充材料。

英文摘要

Speech deepfake detection (DFD) has benefited from diverse acoustic and semantic speech representations, many of which encode valuable speech information and are costly to train. Prior work has shown that affective cues improve DFD, yet existing approaches either fuse emotion with other task-specific features in complex pipelines or directly fine-tune representations toward DFD objectives, risking distortion of the original speech representations that support downstream tasks such as speaker verification (SV) or automatic speech recognition (ASR). We propose a simpler approach: fine-tuning speech encoders on emotion recognition alone-without any DFD supervision, and training a lightweight support vector machine (SVM) on the frozen emotion-tuned representations for DFD. This preserves the original representation capacity for downstream tasks such as SV and ASR, while emergently improving DFD performance. Crucially, we find that emotion is uniquely effective as this bridging task: replacing it with speaker identity even degrades DFD performance, demonstrating that the benefit stems from emotion's role as a natural bridge between speech representation and DFD. Experiments on FakeOrReal and In-the-Wild show accuracy improvements of up to 6\% and 2\% with corresponding EER reductions, while analysis on ASVspoof 2019 LA reveals dataset-specific speaker bias in the real-speech subset. Code is available at supplementary materials.

12. 其他/综合语音音频 11 篇

2606.16505 2026-06-16 cs.SD cs.LG 新提交

Semi-Supervised Speech Confidence Detection using Pseudo-Labelling and Whisper Embeddings

半监督语音自信度检测:使用伪标签和Whisper嵌入

Adam Wynn, Jingyun Wang, Xiangyu Tan

发表机构 * Durham University(杜伦大学) Shanghai Open University(上海开放大学)

AI总结 提出一种结合人工特征与Whisper嵌入的框架,通过伪标签技术扩充数据,利用共注意力机制融合特征,实现75%的语音自信度检测准确率。

Comments 8 pages, 3 figures. Published in the Proceedings of the 26th International Conference on Artificial Intelligence in Education (AIED 2025). Shorter, preliminary version of arXiv:2605.12387

详情
Journal ref
AIED 2025. LNCS vol 15882. Springer, Cham (2025)
AI中文摘要

理解说话者的自信度在教育环境中至关重要,因为它可以增强个性化反馈并改善学习成果。本研究引入了一种新颖的框架,通过将人工设计的特征与Whisper编码器的嵌入相结合来检测说话者的自信度。为了解决数据限制问题,采用伪标签技术来扩展标记数据集,使模型能够从人工标注和模型生成的标签中学习。该框架将传统语音特征(包括音高、音量、语速以及不流畅和重音的存在)与Whisper嵌入相结合,并使用共注意力机制融合这些表示,实现了75%的整体准确率。本研究有助于推进语音分析,支持个性化学习和口语技能发展的应用。

英文摘要

Understanding speaker confidence is crucial in educational settings, as it can enhance personalised feedback and improve learning outcomes. This study introduces a novel framework for detecting speaker confidence by integrating human-engineered features with embeddings from the Whisper encoder. To address data limitations, a pseudo-labelling technique is employed to expand the labelled dataset, allowing the model to learn from both human-annotated and model-generated labels. The framework combines traditional speech features including pitch, volume, rate of speech, and the presence of disfluencies and stress, with Whisper embeddings, and uses a co-attention mechanism to fuse these representations and achieve an overall accuracy of 75%. This study contributes to advancing speech analysis, enabling applications that support personalised learning and speaking skill development.

2606.15141 2026-06-16 eess.AS cs.AI cs.SD 交叉投稿

EChO-Agent: Evidence Chain Orchestration Agent for Audio Reasoning

EChO-Agent: 用于音频推理的证据链编排智能体

Siyuan Zhang, Jian Zong, Junyu Wang, Peiyuan Jiang, Jiahao Yan, Jingyu Zhang, Tianrui Wang, Xiaobao Wang, Longbiao Wang, Jianwu Dang

发表机构 * School of Artificial Intelligence, Tianjin University(天津大学人工智能学院)

AI总结 提出EChO-Agent模块化框架,将复杂音频问答转化为规划、工具执行、证据整合和答案验证流程,在MMAR基准上提升准确率和评分。

Comments 5 pages, 2 figures. Accepted by Interspeech 2026

详情
AI中文摘要

虽然LALMs在音频问答上展现出潜力,但在处理复杂音频推理时,它们未能聚焦于问题相关的音频片段,也无法提供清晰、可检查的推理过程。强化学习和工具增强提示可以帮助模型更好地将问题与音频关联起来,但缺乏可靠的方式来理解、整合和自验证音频片段。为弥补这一不足,我们提出了EChO-Agent,一个模块化智能体框架,将复杂的音频问答重新表述为规划、工具执行、证据整合和答案验证的工作流程。在MMAR基准上的实验表明,EChO-Agent在准确率和评分上均优于基线,消融研究显示证据整合是关键因素。

英文摘要

While LALMs show promise on audio question answering, they fail to focus on question-relevant segments of audio and provide a clear, checkable reasoning process when dealing with complex audio reasoning. Reinforcement learning and tool-augmented prompting can help models better relate questions to audio but lack a reliable way to understand, integrate, and self-verify audio segments. To address this gap, we present EChO-Agent, a modular agent framework that reformulates complex audio QA as a planning, tool execution, evidence integration, and answer verification workflow. Experiments on MMAR benchmark show EChO-Agent improves both accuracy and rubric scores over baseline and ablation studies show evidence integration is the key factor.

2606.15638 2026-06-16 eess.AS cs.SD 交叉投稿

MambAdapter: Lightweight Mamba-Based Adapters for Parameter-Efficient Transfer Learning in Speech and Audio

MambAdapter:基于轻量级Mamba适配器的语音和音频参数高效迁移学习

Salman Hussain Ali, Umberto Cappellazzo, Mirco Ravanelli

发表机构 * Université de Montréal(蒙特利尔大学) Imperial College London(伦敦帝国理工学院) Concordia University(康科迪亚大学) Mila – Quebec AI Institute(魁北克AI研究院)

AI总结 提出MambAdapter方法,将Mamba状态空间模型集成到低秩瓶颈适配器中,通过参数共享和轻量级Mamba模块实现音频特征的高效建模,在四个音频分类和五个语音识别任务上匹配或超越现有PETL方法。

Comments Accepted to Interspeech 2026. Code available at: https://github.com/salman-ha/MambAdapter

详情
AI中文摘要

微调基于Transformer的基础模型已成为音频和语音处理领域域适应的主导策略。为了降低这一过程的计算和内存成本,参数高效迁移学习(PETL)方法已被广泛探索。与此同时,最近的状态空间模型Mamba作为Transformer在序列建模中的有前途的替代方案出现。在这项工作中,我们提出了MambAdapter,一种将Mamba集成到低秩瓶颈适配器中的参数高效迁移学习方法。我们的设计结合了跨适配器的参数共享和轻量级Mamba模块的注入,从而能够更有效地建模音频特征。我们证明,即使在减少参数预算的情况下,MambAdapter在四个音频分类任务和五个语音识别语言上也能匹配或超越强大的PETL基线。

英文摘要

Fine-tuning Transformer-based foundation models has become the dominant strategy for domain adaptation in audio and speech processing. To reduce the computational and memory costs of this process, parameter-efficient transfer learning (PETL) methods have been widely explored. Meanwhile, Mamba, a recent state-space model, has emerged as a promising alternative to Transformers for sequence modeling. In this work, we present MambAdapter, a parameter-efficient transfer learning approach that integrates Mamba into low-rank bottleneck adapters. Our design combines parameter sharing across adapters with the injection of a lightweight Mamba module, enabling more effective modeling of audio features. We demonstrate that MambAdapter matches or outperforms strong PETL baselines on four audio classification tasks and five speech recognition languages, even when operating under reduced parameter budgets.

2510.07442 2026-06-16 cs.SD 版本更新

INFER : Learning Implicit Neural Frequency Response Fields for Confined Car Cabin

INFER:学习受限汽车座舱的隐式神经频率响应场

Harshvardhan C. Takawale, Nirupam Roy, Phil Brown

发表机构 * Harvard University(哈佛大学) University of Cambridge(剑桥大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出INFER框架,通过频域神经隐式模型联合学习源和接收器位置、方向下的复值频率响应场,引入端到端前向模型、感知谱监督和Kramers-Kronig约束,在汽车座舱数据上显著降低幅度和相位重建误差。

详情
AI中文摘要

精确的空间声学建模对于在受限、共振环境(如汽车座舱)中实现沉浸式和清晰音频至关重要。当前的调校方法依赖手动、硬件密集且静态的方式,无法考虑频率选择性行为以及乘客存在或座椅调整等动态变化。为解决此问题,我们提出INFER:隐式神经频率响应场,一种频域神经框架,联合条件于源和接收器位置、方向,直接学习受限共振环境(如汽车座舱)内的复值频率响应场。我们在当前神经声学建模方法基础上引入三项关键创新:(1)新颖的端到端频域前向模型,直接学习3D空间中的频率响应场和频率特定衰减;(2)感知和硬件感知的谱监督,强调关键听觉频段并弱化不稳定交叉区域;(3)基于物理的Kramers-Kronig一致性约束,正则化频率相关衰减和延迟。我们在多个汽车座舱中收集的真实数据上评估了该方法。我们的方法在模拟和真实汽车数据集上显著优于时域和混合域基线,平均幅度和相位重建误差分别降低超过39%和51%。INFER为汽车空间中的神经声学建模设立了新的最先进水平。

英文摘要

Accurate modeling of spatial acoustics is critical for immersive and intelligible audio in confined, resonant environments such as car cabins. Current tuning methods are manual, hardware-intensive, and static, failing to account for frequency selective behaviors and dynamic changes like passenger presence or seat adjustments. To address this issue, we propose INFER: Implicit Neural Frequency Response fields, a frequency-domain neural framework that is jointly conditioned on source and receiver positions, orientations to directly learn complex-valued frequency response fields inside confined, resonant environments like car cabins. We introduce three key innovations over current neural acoustic modeling methods: (1) novel end-to-end frequency-domain forward model that directly learns the frequency response field and frequency-specific attenuation in 3D space; (2) perceptual and hardware-aware spectral supervision that emphasizes critical auditory frequency bands and deemphasizes unstable crossover regions; and (3) a physics-based Kramers-Kronig consistency constraint that regularizes frequency-dependent attenuation and delay. We evaluate our method over real-world data collected in multiple car cabins. Our approach significantly outperforms time- and hybrid-domain baselines on both simulated and real-world automotive datasets, cutting average magnitude and phase reconstruction errors by over 39% and 51%, respectively. INFER sets a new state-of-the-art for neural acoustic modeling in automotive spaces

2512.15313 2026-06-16 cs.SD cs.LG 版本更新

Time-Varying Audio Effect Modeling by End-to-End Adversarial Training

通过端到端对抗训练进行时变音频效果建模

Yann Bourdin, Pierrick Legrand, Fanny Roche

发表机构 * Arturia Inria center at the University of Bordeaux(Inria中心,位于波尔多大学)

AI总结 提出一种生成对抗网络框架,仅用输入输出音频记录建模时变音频效果,无需调制信号提取,通过两阶段训练策略和状态预测网络实现黑箱建模。

Comments (03/2026) Accepted to the Journal of the Audio Engineering Society (JAES). Accompanying website: https://ybourdin.github.io/sptvmod

详情
AI中文摘要

深度学习已成为音频效果建模的标准方法,但严格的黒箱建模对于时变系统仍然存在问题。与时不变效果不同,在具有内部调制的设备上训练模型通常需要记录或提取控制信号,以确保标准损失函数所需的时间对齐。本文介绍了一种生成对抗网络(GAN)框架,仅使用输入输出音频记录来建模此类效果,无需调制信号提取。我们提出了一种卷积循环架构,通过两阶段策略进行训练:初始对抗阶段允许模型在没有严格相位约束的情况下学习调制行为的分布,随后是监督微调阶段,其中状态预测网络(SPN)估计所需的初始内部状态,以使模型与目标同步。此外,开发了一种基于啁啾信号的新指标来量化调制精度。对复古硬件移相器的建模实验证明了该方法在完全黑箱上下文中捕获时变动态的能力。

英文摘要

Deep learning has become a standard approach for the modeling of audio effects, yet strictly black-box modeling remains problematic for time-varying systems. Unlike time-invariant effects, training models on devices with internal modulation typically requires the recording or extraction of control signals to ensure the time-alignment required by standard loss functions. This paper introduces a Generative Adversarial Network (GAN) framework to model such effects using only input-output audio recordings, without requiring a modulation signal extraction. We propose a convolutional-recurrent architecture trained via a two-stage strategy: an initial adversarial phase allows the model to learn the distribution of the modulation behavior without strict phase constraints, followed by a supervised fine-tuning phase where a State Prediction Network (SPN) estimates the initial internal states required to synchronize the model with the target. Additionally, a new metric based on chirp-train signals is developed to quantify modulation accuracy. Experiments modeling a vintage hardware phaser demonstrate the method's ability to capture time-varying dynamics in a fully black-box context.

2506.21613 2026-06-16 cs.CL cs.SD eess.AS 版本更新

ChildGuard: A Specialized Dataset for Combatting Child-Targeted Hate Speech

ChildGuard:针对儿童仇恨言论的专用数据集

Gautam Siddharth Kashyap, Mohammad Anas Azeez, Rafiq Ali, Zohaib Hasan Siddiqui, Jiechao Gao, Usman Naseem

发表机构 * Macquarie University(麦考瑞大学) MBZUAI(穆罕默德·本·拉希德人工智能研究所) DSEU(德里国家理工学院) Department of Information and Computer Science, KFUPM(科威特石油大学信息与计算机科学系) Stanford University(斯坦福大学)

AI总结 针对社交媒体上针对儿童的仇恨言论问题,构建了大规模英文数据集ChildGuard,包含351,877条标注实例,覆盖三个年龄段,并评估了多种模型性能。

Comments Updated Version

详情
AI中文摘要

心理健康行业越来越关注社交媒体上针对儿童的仇恨言论,因为接触此类内容可能在关键发育阶段导致不良心理结果。当前的仇恨言论数据集和检测系统对儿童应用的支持有限,因为它们主要针对成人设计,缺乏针对儿童仇恨言论的年龄特定特征的专门表示。为了解决这一差距,我们引入了ChildGuard,一个大规模英文数据集,用于针对儿童的仇恨言论,包含从X(原Twitter)、Reddit和YouTube收集的351,877条标注实例。该数据集覆盖三个年龄组:幼儿(11岁以下)、前青少年(11-12岁)和青少年(13-17岁)。ChildGuard包含两个子集:上下文子集(157K)和词汇子集(194K)。使用最新的基于Transformer的模型和LLM进行评估,最佳Macro-F1达到82.07%,在幼儿、上下文、隐式仇恨和跨子集设置下分别降至79.41%、79.24%、76.04%和74.88%。

英文摘要

Mental health industry faces growing concerns regarding hate speech directed at children's on social media, as exposure to such content can contribute to adverse psychological outcomes during critical stages of development. Current hate speech datasets and detection systems provide limited support for child-focused applications because they are primarily designed for adults and lack dedicated representations of age-specific characteristics associated with hate speech directed at children's. To address this gap, we introduce ChildGuard, a large-scale English dataset for child-targeted hate speech containing 351,877 annotated instances collected from X (formerly Twitter), Reddit, and YouTube. The dataset covers three age groups such as younger children's (under 11), pre-teens (11-12), and teens (13-17). ChildGuard contains two subsets such as a contextual subset (157K) and a lexical subset (194K). Evaluation using recent transformer-based models and LLMs achieves a best Macro-F1 of 82.07%, decreasing to 79.41%, 79.24%, 76.04%, and 74.88% on younger children's, contextual, implicit hate, and cross-subset settings, respectively.

2605.01101 2026-06-16 cs.AI cs.CL cs.SD eess.AS 版本更新

Virtual Speech Therapist: A Clinician-in-the-Loop AI Speech Therapy Agent for Personalized and Supervised Therapy

虚拟言语治疗师:一种临床医生参与的AI言语治疗代理,用于个性化和监督式治疗

Shakeel Sheikh, Patrick Marmaroli, MD Sahidullah, Slim Ouni, Fabrice Hirsch, Goncalo Leal, Bjorn W Schuller

发表机构 * The Kashmir Hub for Artficial Intelligence(喀布尔人工智能中心) Microsoft / Vocametrix(微软 / Vocametrix) IAI, TCG CREST(IAI,TCG CREST) Université de Lorraine, CNRS, Inria, LORIA(洛林大学,CNRS,Inria,LORIA) Laboratoire Praxiling, UMR5267, CNRS et Université Paul-Valéry Montpellier 3(Praxiling实验室,UMR5267,CNRS及蒙彼利埃Paul-Valéry大学) Speechcare iStutter, Portuguese Catholic University(Speechcare iStutter,葡萄牙天主教大学) CHI – Chair of Health Informatics, TUM University Hospital(健康信息学系,TUM大学医院) GLAM – Group on Language, Audio, & Music, Imperial College London(语言、音频与音乐小组,伦敦帝国理工学院)

AI总结 提出虚拟言语治疗师(VST)平台,集成深度学习口吃分类与多智能体大语言模型推理,自动生成个性化治疗方案,并通过临床医生反馈优化,实验证明其高质量推荐。

Comments Under Review

详情
AI中文摘要

本文开发了虚拟言语治疗师(VST),这是一个基于智能体的平台,通过自动化和自适应的AI驱动工作流程,简化口吃评估并提供定制化的治疗计划。VST集成了最先进的基于深度学习的口吃分类和多智能体大语言模型(LLM)推理,以支持循证临床决策。VST首先获取并提取患者语音样本的特征,然后对口吃类型进行稳健分类。基于这些输出,VST启动一个智能体推理过程,其中专门的LLM智能体自主生成、批评并迭代优化个性化治疗计划。一个专门的批评智能体评估所有生成的治疗计划,以确保临床安全性、方法学合理性,并与同行评审的证据和既定专业指南保持一致。最终输出是一个全面的、针对患者的治疗草案,供临床医生审查。系统结合临床医生的反馈,生成最终的治疗计划,适用于患者交付,从而保持临床医生参与的范式。由专家言语治疗师进行的实验评估证实,VST持续生成高质量、基于证据的治疗建议。这些发现表明该系统具有增强临床工作流程、减轻临床医生负担并改善言语障碍患者治疗效果的潜力。所提出系统的交互式用户界面可在以下网址在线获取:this https URL,支持实时口吃评估和个性化治疗计划。

英文摘要

This paper develops Virtual Speech Therapist (VST), an intelligent agent-based platform that streamlines stuttering assessment and delivers customized therapy planning through automated and adaptive AI-driven workflows. VST integrates state-of-the-art deep learning-based stuttering classification, and multi-agent large language model (LLM) reasoning to support evidence-based clinical decision-making. The VST begins with the acquisition and feature extraction of patient speech samples, followed by robust classification of stuttering types. Building on these outputs, VST initiates an agentic reasoning process in which specialized LLM agents autonomously generate, critique, and iteratively refine individualized therapy plans. A dedicated critic agent evaluates all generated therapy plans to ensure clinical safety, methodological soundness, and alignment with peer-reviewed evidence and established professional guidelines. The resulting output is a comprehensive, patient-specific therapy draft intended for clinician review. Incorporating clinician feedback, the system then produces a finalized therapy plan suitable for patient delivery, thereby maintaining a clinician-in-the-loop paradigm. Experimental evaluation by expert speech therapists confirms that VST consistently generates high-quality, evidence-based therapy recommendations. These findings demonstrate the system's potential to augment clinical workflows, reduce clinician burden, and improve therapeutic outcomes for individuals with speech impairments. An interactive user interface for the proposed system is available online at: https://vocametrix.com/ai/stuttering-therapy-planning-agent , facilitating real-time stuttering assessment and personalized therapy planning.

2506.00955 2026-06-16 cs.CL cs.SD eess.AS 版本更新

Leveraging Large Language Models for Sarcastic Speech Annotation in Sarcasm Detection

利用大语言模型进行讽刺语音标注在讽刺检测中的应用

Zhu Li, Yuqing Zhang, Xiyuan Gao, Shekhar Nayak, Matt Coler

发表机构 * University of Groningen(Groningen大学) Speech Technology Lab(语音技术实验室) Center for Language and Cognition(语言与认知中心)

AI总结 本文提出利用大语言模型生成讽刺语音数据集,通过人类验证提升标注质量,并在公开数据集上验证了检测性能,最终引入PodSarc数据集,实现了73.63%的F1分数。

Comments Interspeech 2025; Project page: https://github.com/Abel1802/PodSarc

详情
AI中文摘要

讽刺通过语气和语境改变意义,但语音中检测讽刺仍具挑战性,因数据稀缺。现有检测系统常依赖多模态数据,限制了仅语音可用场景的应用。为此,我们提出一个利用大语言模型(LLMs)生成讽刺数据集的标注流程。使用公开的以讽刺为主的播客,我们采用GPT-4o和LLaMA 3进行初始讽刺标注,随后由人类验证以解决分歧。我们通过在公开讽刺数据集上比较标注质量和检测性能,验证了该方法的有效性。最后,我们引入PodSarc,一个通过此流程生成的大规模讽刺语音数据集。检测模型实现了73.63%的F1分数,证明了该数据集作为讽刺检测研究基准的潜力。

英文摘要

Sarcasm fundamentally alters meaning through tone and context, yet detecting it in speech remains a challenge due to data scarcity. In addition, existing detection systems often rely on multimodal data, limiting their applicability in contexts where only speech is available. To address this, we propose an annotation pipeline that leverages large language models (LLMs) to generate a sarcasm dataset. Using a publicly available sarcasm-focused podcast, we employ GPT-4o and LLaMA 3 for initial sarcasm annotations, followed by human verification to resolve disagreements. We validate this approach by comparing annotation quality and detection performance on a publicly available sarcasm dataset using a collaborative gating architecture. Finally, we introduce PodSarc, a large-scale sarcastic speech dataset created through this pipeline. The detection model achieves a 73.63% F1 score, demonstrating the dataset's potential as a benchmark for sarcasm detection research.

2601.03612 2026-06-16 cs.LG cs.SD eess.AS 版本更新

Mathematical Foundations of Polyphonic Music Generation via Structural Inductive Bias

通过结构归纳偏差的多声部音乐生成的数学基础

Joonwon Seo

发表机构 * GitHub

AI总结 本文通过结构归纳偏差提出多声部音乐生成的数学框架,采用贝多芬钢琴奏鸣曲案例,引入Smart Embedding架构,减少参数并提升模型稳定性。

Comments 86 pages. A comprehensive monograph on the Smart Embedding architecture for polyphonic music generation. Includes rigorous theoretical proofs using Information Theory, Rademacher Complexity, and the Rank-Preserving Transversality Property (RPTP), along with empirical validation and a human listening study (N=53)

详情
AI中文摘要

本文通过结构归纳偏差解决AI音乐生成中的'缺失中间'问题,即产生连贯的、句级音乐结构的挑战。以贝多芬的钢琴奏鸣曲为例,引入Smart Embedding架构,一种基于经验证实的音高和手部属性独立性(NMI=0.167)的因子化表示。该架构在减少嵌入参数48.3%的同时,将验证损失降低了9.47%。理论层面,通过信息论、Rademacher复杂度分析(得出28.09%更紧的泛化界限)和范畴论解释建立正式保证。这些结果进一步通过奇异值分解分析和盲专家听觉研究(N=53)得到支持。总体而言,本文结合了架构创新与数学严谨性,为复杂序列数据生成模型提供了原则性的框架,使其更加高效、稳定和可解释。

英文摘要

This monograph addresses the "Missing Middle" problem in AI music generation - the challenge of producing coherent, phrase-level musical structure. Using Beethoven's piano sonatas as a case study, I introduce the Smart Embedding architecture, a factorized representation grounded in the empirically verified independence of pitch and hand attributes (NMI=0.167). The architecture achieves a 48.3% reduction in embedding parameters while improving validation loss by 9.47%. Theoretically, I establish formal guarantees through information theory, Rademacher complexity analysis (yielding a 28.09% tighter generalization bound), and category-theoretic interpretation. These results are further supported by Singular Value Decomposition analysis and a blind expert listening study (N=53). Collectively, this work presents a dual contribution that combines architectural innovation with mathematical rigor, offering a principled framework for building more efficient, stable, and interpretable generative models for complex sequential data.

2505.04382 2026-06-16 eess.AS cs.LG cs.SD 版本更新

Discrete Optimal Transport and Voice Conversion

离散最优传输与语音转换

Anton Selitskiy, Maitreya Kocharekar

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出kDOT框架,利用预训练语音嵌入空间进行语音转换,通过离散最优传输计划的质心投影改进分布对齐,提升WER、MOS和FAD性能。

Comments 5 pages, 1 figure, 7 tables. 11th International Conference on Machine Learning Technologies (ICMLT), Berlin, Germany, May 2026

详情
AI中文摘要

我们提出kDOT,一种在预训练语音嵌入空间中运行的离散最优传输(OT)框架,用于语音转换(VC)。与kNN-VC和SinkVC中的平均策略以及MKL中的独立假设不同,我们的方法利用离散OT计划的质心投影来构建源和目标说话人嵌入分布之间的传输映射。我们对传输嵌入数量进行了全面的消融研究,并系统分析了源和目标语音持续时间的影响。在LibriSpeech上的实验表明,具有质心投影的OT在分布对齐方面表现一致,并且在WER、MOS和FAD方面通常优于基于平均的方法。此外,我们还表明,将离散OT作为后处理步骤可以将伪造语音转换为被最新伪造检测器误判为真实语音的样本。这展示了OT在嵌入空间中的强大域适应能力,同时也揭示了伪造检测系统的重要安全影响。

英文摘要

We propose kDOT, a discrete optimal transport (OT) framework for voice conversion (VC) operating in a pretrained speech embedding space. In contrast to the averaging strategies used in kNN-VC and SinkVC, and the independence assumption adopted in MKL, our method employs the barycentric projection of the discrete OT plan to construct a transport map between source and target speaker embedding distributions. We conduct a comprehensive ablation study over the number of transported embeddings and systematically analyze the impact of source and target utterance duration. Experiments on LibriSpeech demonstrate that OT with barycentric projection consistently improves distribution alignment and often outperforms averaging-based approaches in terms of WER, MOS, and FAD. Furthermore, we show that applying discrete OT as a post-processing step can transform spoofed speech into samples that are misclassified as bona fide by a state-of-the-art spoofing detector. This demonstrates the strong domain adaptation capability of OT in embedding space, while also revealing important security implications for spoof detection systems.

2408.14892 2026-06-16 cs.CL cs.SD eess.AS 版本更新

A Functional Trade-off between Prosodic and Semantic Cues in Conveying Sarcasm

语义与语气特征在传达讽刺中的功能权衡

Zhu Li, Xiyuan Gao, Yuqing Zhang, Shekhar Nayak, Matt Coler

发表机构 * ZhuLi(朱莉) XiyuanGao(高西元) YuqingZhang(张雨青) ShekharNayak(Shekhar Nayak) MattColer(Matt Coler)

AI总结 研究通过分析不同讽刺类型语句的声学特征,发现语义明显时语气特征不重要,而语义不明显时语气特征更关键,揭示了讽刺传达中语义与语气特征的权衡关系。

Comments accepted at Interspeech 2024

详情
AI中文摘要

本研究探讨了讽刺的声学特征,并分离了语句被用作讽刺的倾向与语气特征信号之间的相互作用。利用从电视节目中收集的讽刺语句数据集,我们分析了语句和关键短语的语气特征,这些短语属于三种不同的讽刺类别(嵌入式、命题式和施为式),它们在语义特征的强度上有所不同,并与中性表达进行比较。结果表明,在语义明显显示讽刺意义的短语中,语气特征不如语义不明显时重要,这表明在短语层面,讽刺的语气和语义特征之间存在权衡。这些发现突显了在语义密集的讽刺表达中对语气调节的依赖性降低,并揭示了塑造讽刺意图传达的细微互动。

英文摘要

This study investigates the acoustic features of sarcasm and disentangles the interplay between the propensity of an utterance being used sarcastically and the presence of prosodic cues signaling sarcasm. Using a dataset of sarcastic utterances compiled from television shows, we analyze the prosodic features within utterances and key phrases belonging to three distinct sarcasm categories (embedded, propositional, and illocutionary), which vary in the degree of semantic cues present, and compare them to neutral expressions. Results show that in phrases where the sarcastic meaning is salient from the semantics, the prosodic cues are less relevant than when the sarcastic meaning is not evident from the semantics, suggesting a trade-off between prosodic and semantic cues of sarcasm at the phrase level. These findings highlight a lessened reliance on prosodic modulation in semantically dense sarcastic expressions and a nuanced interaction that shapes the communication of sarcastic intent.