arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.06444 2026-06-05 eess.AS cs.CL cs.SD 版本更新

USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding

USAD 2.0:面向通用音频理解的表征蒸馏规模化

Heng-Jui Chang, Alexander H. Liu, Saurabhchand Bhati, Mrudula Athi, Anton Ratnarajah, Amit Chhetri, James Glass

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) Amazon(亚马逊)

AI总结 提出USAD 2.0通用音频编码器,通过领域感知蒸馏融合自监督和监督基础模型知识,并扩展至音乐领域,经深度缩放达到十亿参数,在探测和基于LLM的评估中取得领先性能。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

音频编码器对于现代音频应用至关重要,因为大型语言模型(LLM)越来越依赖单一编码器处理多样输入。虽然自监督学习(SSL)已产生强大的领域特定编码器(如语音或音乐专家),但像USAD和SPEAR这样的多领域方法在覆盖范围和评估方面仍然有限。最近的研究也表明,监督编码器与音频LLM的对齐效果更好。我们提出USAD 2.0,一种融合了SSL和监督基础模型知识的通用编码器。USAD 2.0引入了领域感知蒸馏来解决教师不匹配问题,将覆盖范围扩展到音乐领域,并增加了用于下游任务的第二阶段监督蒸馏。我们进一步通过深度缩放将模型扩展到十亿参数。实验表明,USAD 2.0在探测和基于LLM的评估中取得了强劲或最先进的性能。

英文摘要

Audio encoders are critical to modern audio applications as large language models (LLMs) increasingly rely on a single encoder for diverse inputs. While self-supervised learning (SSL) has yielded strong domain-specific encoders like speech or music experts, multi-domain approaches like USAD and SPEAR remain limited in coverage and evaluation. Recent studies also suggest supervised encoders align better with audio LLMs. We present USAD 2.0, a universal encoder integrating knowledge from both SSL and supervised foundation models. USAD 2.0 introduces domain-aware distillation to address teacher mismatch, extends coverage to the music domain, and adds second-stage supervised distillation for downstream use. We further scale the model to one billion parameters via depth scaling. Experiments show USAD 2.0 achieves strong or state-of-the-art performance across probing and LLM-based evaluations.

2606.06357 2026-06-05 cs.SD cs.AI eess.AS 版本更新

F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation

F3-Tokenizer: 驯服音频自编码器潜在变量以支持理解与生成

Dinghao Zhou, Xingchen Song, Di Wu, Pengyu Cheng, Shengfan Shen, Sixiang Lv

发表机构 * Nanjing University, China(南京大学) WeNet Open Source Community(WeNet开源社区)

AI总结 针对连续音频自编码器潜在变量结构弱、自监督编码器不可解码的问题,提出F3-Tokenizer,通过噪声正则化自编码器瓶颈和潜在侧表示编码器,实现统一的理解与生成音频分词器。

Comments Technical report; early work; 9 pages, 2 figures, 5 tables

详情
AI中文摘要

连续音频自编码器能很好地重建波形,但通常产生的潜在变量结构较弱,不利于理解;而自监督音频编码器能捕捉语义,但不可直接解码。这种不匹配使得单个音频分词器难以同时支持理解和生成。我们通过两个组件将连续自编码器潜在变量适应于这一场景:噪声正则化的自编码器瓶颈和潜在侧表示编码器。瓶颈使用通道归一化和随机扰动代替基于KL的变分训练,为重建和自回归生成提供尺度可控的连续潜在变量。表示编码器在冻结的自编码器潜在变量上使用RQ-MTP和冻结LLM监督进行训练。最终的分词器为理解提供高维表示,同时保留归一化的连续潜在变量作为生成目标。

英文摘要

Continuous audio autoencoders reconstruct waveforms well but often produce latents with weak structure for understanding, while self-supervised audio encoders capture semantics but are not directly decodable. This mismatch complicates a single audio tokenizer that must support both understanding and generation. We adapt continuous autoencoder latents to this setting with two components: a noise-regularized autoencoder bottleneck and a latent-side representation encoder. The bottleneck uses channel normalization and stochastic perturbation instead of KL-based variational training, yielding scale-controlled continuous latents for reconstruction and autoregressive generation. The representation encoder is trained on frozen autoencoder latents with RQ-MTP and frozen-LLM supervision. The resulting tokenizer provides high-dimensional representations for understanding while preserving normalized continuous latents as generation targets

2606.06211 2026-06-05 cs.CL cs.SD eess.AS 版本更新

FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech Recognition

基于FiLM的说话人条件化SpeechLLM用于病理语音识别

Fernando López, Santosh Kesiraju, Jordi Luque

发表机构 * Telefónica Innovación DigitalSpain(西班牙电信创新数字研究院) Universidad Autónoma de MadridSpain(马德里自治大学) Brno University of TechnologyCzech Republic(捷克布拉格技术大学)

AI总结 本研究提出通过特征线性调制(FiLM)将x-vector说话人信息注入冻结的ASR编码器各Transformer层,实现对病理语音的说话人自适应,在不修改基础模型权重的情况下提升识别性能,并保持对非条件化语音的问答能力。

Comments Accepted in Odyssey 2026: The Speaker and Language Recognition Workshop

详情
AI中文摘要

自动语音识别(ASR)在标准语音方面取得了显著进展;然而,来自神经系统疾病的病理语音仍然是一个重大挑战。我们研究了通过特征线性调制(FiLM)进行说话人条件化,将x-vector派生信息注入冻结的ASR编码器的每个Transformer层,以在不修改基础模型权重的情况下适应个体病理说话人的内部表示。我们在西班牙语和英语病理语音上,针对ASR任务将其与标准和参数高效微调基线进行基准测试,并辅以后处理。此外,我们评估了自适应模型是否保留了回答语音相关问题的能力。结果表明,说话人条件化的ASR与已建立的适应策略具有竞争力,同时保持了对非条件化语音的性能。

英文摘要

Automatic speech recognition (ASR) has advanced remarkably for standard speech; however, pathological speech from neurological conditions remains a significant challenge. We investigate speaker conditioning via Feature-wise Linear Modulation (FiLM), injecting x-vector-derived information into each transformer layer of a frozen ASR encoder to adapt internal representations to individual pathological speakers without modifying base model weights. We benchmark this for the ASR task against standard and parameter-efficient fine-tuning baselines, complemented by post-processing, on Spanish and English pathological speech. Additionally, we evaluate if the adapted model preserves the ability to answer speech-related questions. Results show that speaker-conditioned ASR is competitive with established adaptation strategies while retaining performance on non-conditioned speech.

2606.06200 2026-06-05 cs.SD eess.AS 版本更新

Learning Emotion-discriminative Representations for Zero-Shot Cross-lingual Speech Emotion Recognition

学习情感判别表示用于零样本跨语言语音情感识别

Jinyi Mi, Ding Ma, Tomoki Toda

发表机构 * Graduate School of Informatics, Nagoya University, Japan(名古屋大学信息学研究科) Information Technology Center, Nagoya University, Japan(名古屋大学信息技术中心)

AI总结 针对零样本跨语言语音情感识别中语言分布不匹配和目标语言缺乏情感标注的问题,提出一种结合监督对比学习和说话人对抗学习的情感判别表示学习方法,显著提升了识别性能。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

零样本跨语言语音情感识别(SER)由于语言间的分布不匹配以及目标语言缺乏情感标注而仍然具有挑战性。在这种情况下,仅使用源语言数据训练的模型在评估未见过的目标语言时,常常会出现泛化能力下降的问题。为了解决这一局限性,我们提出了一种情感判别表示学习方法,该方法集成了监督对比学习和说话人对抗学习。对比学习促进了跨语言情感对齐,而说话人对抗学习则抑制了与说话人相关的线索,以鼓励说话人不变的表示。在零样本跨语言SER设置下的实验结果表明,与传统训练策略相比,所提出的方法显著提高了SER性能。

英文摘要

Zero-shot cross-lingual speech emotion recognition (SER) remains challenging due to distribution mismatches across languages and the lack of emotion annotations in target language. Under such conditions, models trained solely on source-language data frequently suffer from degraded generalization when evaluated on unseen target languages. To address this limitation, we propose an emotion-discriminative representation learning method that integrates supervised contrastive learning and speaker adversarial learning. The contrastive learning promotes cross-lingual emotion alignment, while speaker adversarial learning suppresses speaker-related cues to encourage speaker-invariant representations. Experimental results under a zero-shot cross-lingual SER setting demonstrate that the proposed method significantly improves SER performance over conventional training strategies.

2606.05911 2026-06-05 cs.SD cs.LG eess.AS 版本更新

DBHN-Net: Dual-Branch Hybrid Neural Network For Low-Complexity Monaural Speech Enhancement

DBHN-Net: 低复杂度单声道语音增强的双分支混合神经网络

Cunhang Fan, Enrui Liu, Jing Zhou, Jian Kang, Jie Li, Andong Li, Jian Zhou, Zhao Lv, Xuelong Li

发表机构 * State Key Laboratory of Opto-Electronic Information Acquisition and Protection Technology, (School of Computer Science and Technology), Anhui University(光电信息获取与防护技术国家重点实验室(计算机科学与技术学院),安徽大学) China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd(中国电信人工智能技术(北京)有限公司) Institute of Acoustics, University of Chinese Academy of Sciences(中国科学院声学研究所) Institute of Artificial Intelligence (TeleAI), China Telecom, China(人工智能研究所(TeleAI),中国电信,中国)

AI总结 提出一种结合ANN和SNN的双分支混合神经网络,通过BandSplit、TF-Mamba等模块降低计算复杂度,同时利用交互和融合模块保持性能,在三个公共数据集上实现平均7.5倍复杂度降低。

Comments This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI)

详情
Journal ref
IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI2026)
AI中文摘要

尽管基于人工神经网络(ANN)的语音增强(SE)方法表现出色,但高计算复杂度和高能耗阻碍了它们在实际前端处理任务中的部署。目前,脉冲神经网络(SNN)在降低功耗方面显示出潜力。然而,SNN的离散二进制激活和复杂的时空动态常常导致信息丢失。因此,当前的挑战集中在如何保持性能并降低计算复杂度。为了解决这个问题,本文提出了一种双分支混合神经网络(DBHN)。1)在网络架构方面:设计了一个集成ANN和SNN的双分支网络,其中SNN分支降低功耗,而ANN分支解决信息丢失;开发了BandSplit和时频(TF)-Mamba模块,以同时压缩能耗和增强模型性能;实现了带有残差连接的脉冲特征提取组(SFEG)和信息转换块(ITB)组件,以减轻信息丢失,同时进一步细化特征表示。2)为了促进分支间的信息融合:设计了一个交互模块,以促进双分支网络各个阶段的信息交换;设计了一个TF交叉注意力融合模块,在数据自适应地引导SNN分支保留更多关键信息的同时,对双分支信息进行时频域融合。结果表明,所提出的模型在三个公共数据集上保持了优越的性能,同时与基线模型相比,计算复杂度平均降低了7.5倍。

英文摘要

Although artificial neural network (ANN) based speech enhancement (SE) methods demonstrate excellent performance, the high computational complexity and high energy consumption hinder their deployment in practical front-end processing tasks.} Currently, the spiking neural networks (SNNs) have shown potential in reducing power consumption. However, the discrete binary activation and complex spatio-temporal dynamics of SNNs often result in information loss. The current challenge therefore focuses on how to maintain performance and reduce computational complexity. To address this issue, this work propose a Dual-Branch Hybrid Neural (DBHN) Network. 1) In terms of network architecture: A dual-branch network integrating ANN and SNN was designed, where the SNN branch reduces power consumption while the ANN branch addresses information loss; The BandSplit and Time-Frequency (TF) -Mamba modules were developed to simultaneously compress energy consumption and enhance model performance; Spiking Feature Extraction Group (SFEG) and Information Transformation Block (ITB) components were implemented with residual connections to mitigate information loss while further refining feature representations. 2) To facilitate inter-branch information fusion: An Interaction module was designed to promote information exchange at various stages of the dual-branch network; A TF-Cross Attention-Fusion module was designed to perform time-frequency domain fusion of dual-branch information while data-adaptively guiding the SNN branch to retain more critical information. Results show that the proposed model maintains superior performance across three public datasets while achieving an average 7.5 fold reduction in computational complexity compared to baseline models.

2606.05909 2026-06-05 cs.SD eess.AS 版本更新

Beyond WER: A Paired Acoustic Stress Test for Ambient Clinical Scribes

超越WER:面向环境临床记录员的配对声学压力测试

Xiao-Hang Jiang, Han-Jie Guo, Ying-Si Liang, Yang Ai, Zhen-Hua Ling, Lei Jiang, Zhi-Yang He

发表机构 * University of Science and Technology of China(中国科学技术大学) iFLYTEK Co., Ltd.(iFLYTEK公司)

AI总结 提出配对声学压力测试方法,通过注入噪声并冻结下游模型,揭示噪声对临床推理的安全影响,发现轻微声学扰动可逆转临床意义而不显著增加词错误率,并展示轻量级缓解策略。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

环境临床记录员越来越多地将自动语音识别与大型语言模型结合以自动化文档记录。然而,词错误率等传统指标掩盖了系统性的安全性退化。我们提出了一种配对声学压力测试,以隔离噪声对临床推理的因果影响。对于相同的对话,我们在保持下游模型配置不变的情况下注入多种噪声类型。关键的是,我们发现信号保真度与临床安全性之间存在危险的脱节。平稳环境噪声使词错误率仅增加了微不足道的0.71个百分点,但几乎使不安全输出的比例翻倍。我们的分析表明,轻微的声学扰动可以在不显著增加错误率的情况下逆转临床含义。此外,我们展示了一种轻量级缓解策略,该策略在噪声条件下减轻安全性退化,而无需进行模型微调。

英文摘要

Ambient clinical scribes increasingly combine Automatic Speech Recognition with Large Language Models to automate documentation. However, traditional metrics like Word Error Rate mask systemic safety degradation. We present a paired acoustic stress test to isolate the causal impact of noise on clinical reasoning. For the same dialogues, we inject diverse noise types while keeping the downstream model configuration frozen. Crucially, we uncover a dangerous disconnect between signal fidelity and clinical safety. Stationary ambient noise increased the Word Error Rate by a negligible 0.71 percentage points yet nearly doubled the rate of unsafe outputs. Our analysis reveals that minor acoustic perturbations can invert clinical meaning without substantially inflating error rates. Furthermore, we demonstrate a lightweight mitigation strategy that mitigates safety degradation under noisy conditions without requiring model fine tuning.

2606.05889 2026-06-05 cs.SD cs.CL eess.AS 版本更新

GLASS: GRPO-Trained LoRA for Acoustic Style Steering in Zero-Shot Text-to-Speech

GLASS: 基于GRPO训练的LoRA用于零样本文本转语音中的声学风格引导

Jaehoon Kang, Yejin Lee, Kyuhong Shim

发表机构 * Department of Artificial Intelligence, Sungkyunkwan University(人工智能系,全州大学)

AI总结 提出GLASS框架,通过GRPO训练轻量LoRA适配器实现零样本自回归TTS中可组合的声学风格控制,无需风格标签即可从奖励中学习控制。

详情
AI中文摘要

我们提出GLASS,一个用于零样本自回归文本转语音(TTS)中可组合声学风格控制的框架,该框架从生成后奖励而非风格标签中学习控制。在零样本TTS中,说话人提示通常将说话人身份与语速、音高等韵律属性纠缠在一起,使得在不改变提示本身的情况下难以改变风格。GLASS将每个声学属性视为一个由奖励定义的控制方向。对于每个控制轴,GLASS冻结TTS主干,并使用组相对策略优化(GRPO)训练一个轻量级LoRA适配器,以语音令牌长度和平均F0作为风格奖励,以WER作为可懂度锚点。由于每个控制表示为LoRA权重更新,独立训练的适配器可以通过线性LoRA算术进行交换、插值和组合,而无需重新训练主干。在语速和音高控制上的实验显示了目标风格偏移,同时保持了自然度、说话人相似性和可懂度,并展示了跨独立训练适配器的平滑插值和多轴组合。

英文摘要

We propose GLASS, a framework for composable acoustic style control in zero-shot autoregressive text-to-speech (TTS) that learns controls from post-generation rewards rather than style labels. In zero-shot TTS, a speaker prompt often entangles speaker identity with prosodic attributes such as speaking rate and pitch, making it difficult to change style without changing the prompt itself. GLASS instead treats each acoustic attribute as a reward-defined control direction. For each control axis, GLASS freezes the TTS backbone and trains one lightweight LoRA adapter with Group Relative Policy Optimization (GRPO), using speech-token length and mean F0 as style rewards and WER as an intelligibility anchor. Because each control is represented as a LoRA weight update, independently trained adapters can be swapped, interpolated, and composed through linear LoRA arithmetic without retraining the backbone. Experiments on speaking rate and pitch control show targeted style shifts while preserving naturalness, speaker similarity, and intelligibility, and demonstrate smooth interpolation and multi-axis composition across independently trained adapters.

2606.05852 2026-06-05 cs.SD cs.AI eess.AS 版本更新

UniVoice: A Unified Model for Speech and Singing Voice Generation

UniVoice: 一种用于语音和歌声生成的统一模型

Junjie Zheng, Huixin Xue, Shihong Ren, Chaofan Ding, Hao Liu, Zihao Chen

发表机构 * Giant Network(巨量网络) Shanghai Conservatory of Music(上海音乐学院)

AI总结 提出UniVoice,一种基于条件流匹配的统一语音和歌声生成框架,通过将条件分解为内容、旋律和音色,并引入空旋律标记,实现单一模型同时生成自然语音和可控歌声。

Comments 9 pages, 2 figures

详情
AI中文摘要

文本到语音(TTS)和歌声合成(SVS)都旨在从符号输入生成人类声音音频,但它们对生成过程提出了不同的要求。语音生成依赖于灵活的、语言驱动的韵律,而歌声生成则需要明确的旋律控制和准确的节奏对齐。这种不匹配使得训练一个既能生成自然语音又能生成可控歌声的单一模型具有挑战性,因为与旋律相关的条件应该强烈约束歌声,但不应限制语音韵律。我们提出了UniVoice,一种基于条件流匹配的统一语音和歌声生成框架。UniVoice没有使用单一的未分化条件表示,而是将条件分解为内容、旋律和音色,这些由适合模态的编码器编码,并由共享的扩散变换器(DiT)主干网络使用。对于歌声,旋律条件由MIDI音符序列表示;对于语音,它被替换为学习的空旋律标记,使模型能够从语言和声学上下文中推断韵律。这种设计保留了歌声的显式旋律控制,同时避免了对语音施加旋律约束的需要。我们进一步将空旋律标记分析为条件流中旋律边缘化的近似。在3万小时语音和3.5万小时歌声数据上训练,UniVoice在语音上实现了5.26%的音素错误率(PER),与专用TTS系统如F5-TTS(5.21%)和CosyVoice3(5.30%)相当。在歌声生成上,UniVoice实现了16.22%的PER,优于统一基线Vevo1.5(24.72%)。

英文摘要

Text-to-speech (TTS) and singing voice synthesis (SVS) both aim to generate human vocal audio from symbolic inputs, but they impose different requirements on the generation process. Speech generation relies on flexible, language-driven prosody, whereas singing generation requires explicit melody control and accurate rhythmic alignment. This mismatch makes it challenging to train a single model that can generate both natural speech and controllable singing, since melody-related conditions should strongly constrain singing but should not restrict speech prosody. We present UniVoice, a unified speech and singing voice generation framework based on conditional flow matching. Instead of using a single undifferentiated conditioning representation, UniVoice factorizes the condition into content, melody, and timbre, which are encoded by modality-appropriate encoders and consumed by a shared Diffusion Transformer (DiT) backbone. For singing, the melody condition is represented by MIDI note sequences; for speech, it is replaced with a learned null melody token, allowing the model to infer prosody from linguistic and acoustic context. This design preserves explicit melody control for singing while avoiding the need to impose melody constraints on speech. We further analyze the null melody token as an approximation to melody marginalization in the conditional flow. Trained on 30k hours of speech and 35k hours of singing data, UniVoice achieves a speech PER of 5.26\%, comparable to dedicated TTS systems such as F5-TTS (5.21\%) and CosyVoice3 (5.30\%). On singing generation, UniVoice achieves a PER of 16.22\%, outperforming the unified baseline Vevo1.5 (24.72\%).

2606.05754 2026-06-05 cs.SD cs.AI eess.AS 版本更新

SagnacAssisted Enhanced OTDR for Distributed Acoustic Sensing: A Standardized Benchmark and Engineering Evaluation Framework

Sagnac辅助增强型OTDR分布式声学传感:标准化基准与工程评估框架

Weiguang Wang, Fugen Wu, Hailing Wang, Xuechen Liang, Xiaobin Li, Ru Han, Tianchang Xie

发表机构 * East China Jiaotong University(东华交通大学) School of Materials and Energy, Guangdong University of Technology(广东工业大学材料与能源学院) Jiangxi Tonghui Technology Group Co., Ltd.(江西 Tonghui 技术集团有限公司) School of Artificial Intelligence and Big Data, Guangzhou Vocational University of Science and Technology(广州科学技术职业大学人工智能与大数据学院)

AI总结 提出一种Sagnac辅助增强型ϕ-OTDR传感架构和标准化基准框架,通过双分支融合模型在10公里光纤上实现89.79%准确率和5.00%虚警率,解决了偏振衰落和干扰问题。

详情
AI中文摘要

相位敏感光时域反射计(ϕ-OTDR)因其在大距离上提供分布式时空监测能力,被广泛应用于大规模分布式声学传感(DAS)。然而,其现场性能仍可能因偏振诱导衰落(PIF)、局部信号退化和强环境干扰而恶化。本研究开发了一种Sagnac辅助增强型ϕ-OTDR传感架构和面向工程的DAS事件识别标准化基准框架。Sagnac干涉仪提供连续相位响应,补充了ϕ-OTDR通道中易衰落的观测值,并通过在FPGA平台上实现的互相关过程实现异构信号对齐。该基准协议在一致的数据划分、预处理和度量定义下,比较了传统特征工程方法、概率浅层分类器、单分支深度模型和双分支融合模型。在10公里传感光纤上进行的六类代表性声学事件实验表明,双分支融合模型在评估方法中提供了最有利的权衡,在平衡测试集上达到89.79%的准确率、89.83%的宏F1值和5.00%的虚警率。结果还表明,通道分组对双分支评估影响显著,表明面向部署的结论应基于准确率、宏F1、虚警率、漏报率和延迟,而非仅凭准确率。这项工作为基于ϕ-OTDR的DAS提供了一种物理驱动的增强策略,并为未来面向融合的传感研究提供了可复现的基准协议。用于复现DAS事件识别实验的实现和脚本可在https://github.com/wawa-abc/das公开获取。

英文摘要

Phase-sensitive optical time-domain reflectometry ($ϕ$-OTDR) is widely used in large-scale distributed acoustic sensing (DAS) because it provides distributed spatiotemporal monitoring over long sensing distances. Its field performance can still deteriorate because of polarization-induced fading (PIF), local signal degradation, and strong environmental interference. This study develops a Sagnac-assisted enhanced $ϕ$-OTDR sensing architecture and a standardized benchmark framework for engineering-oriented DAS event recognition. The Sagnac interferometer provides a continuous phase response that supplements fading-prone observations in the $ϕ$-OTDR channel, and heterogeneous signal alignment is achieved using a cross-correlation procedure implemented on an FPGA platform. The benchmark protocol compares conventional feature-engineering methods, probabilistic shallow classifiers, single-branch deep models, and dual-branch fusion models under consistent data partitioning, preprocessing, and metric definitions. Experiments on a 10-km sensing fiber with six representative acoustic event classes show that the dual-branch fusion model provides the most favorable trade-off among the evaluated methods, reaching 89.79\% accuracy, 89.83\% macro-F1, and a nuisance alarm rate of 5.00\% on the balanced test set. The results also show that channel grouping strongly affects dual-branch evaluation, indicating that deployment-oriented conclusions should be based on accuracy, macro-F1, nuisance alarm rate, false negative rate, and latency rather than accuracy alone. This work provides a physically motivated enhancement strategy for $ϕ$-OTDR-based DAS and a reproducible benchmark protocol for future fusion-oriented sensing research. The implementation and scripts for reproducing the DAS event-recognition experiments are publicly available at https://github.com/wawa-abc/das.

2606.05713 2026-06-05 cs.MM cs.SD eess.AS 版本更新

Beyond Generative Decoding: Discriminative Hidden-State Readout from a Native Omni-Modal LLM for Multimodal Sentiment Analysis

超越生成式解码:来自原生全模态大语言模型的判别性隐藏状态读出用于多模态情感分析

Bin Wen, Tien-Ping Tan

发表机构 * School of Computer Sciences, Universiti Sains Malaysia, Penang, Malaysia(计算机科学学院,马来西亚国际科学大学,槟城)

AI总结 针对多模态情感分析中生成式读出将连续回归绑定到离散自回归解码导致精度和效率损失的问题,提出基于原生全模态大语言模型Qwen2.5-Omni-7B的Thinker模块的判别性读出方法,通过轻量回归头直接映射最终层隐藏状态,在单消费级GPU上实现最先进性能。

Comments 18 pages, 4 figures, 6 tables

详情
AI中文摘要

多模态情感分析(MSA)从语言、声学和视觉信号推断人类情感。最近的方法越来越多地通过生成式读出适应大型多模态模型(LMM):提示模型将情感分数作为文本字符串输出。虽然方便,但这将连续回归与离散自回归解码绑定,带来了未测量的成本。我们重新审视这种读出机制,并提出一种基于原生全模态大语言模型(Qwen2.5-Omni-7B)的Thinker模块构建的判别性公式。我们不是进行文本解码,而是通过轻量回归头在单次前向传播中将最后一个非填充标记的最终层隐藏状态映射到连续分数。使用4位量化和低秩适应(QLoRA),整个7B管道——包括视频和音频处理——在单个消费级GPU(RTX 5090,32 GB)上训练,峰值内存10-21 GB,可训练参数仅1.14%。通过固定骨干网络、数据和LoRA配置的受控比较,我们隔离了读出的影响。在CMU-MOSI和CMU-MOSEI上,我们的判别性读出无需任务特定特征工程即可达到最先进的准确率(MOSI:MAE 0.551,Corr 0.888;MOSEI:MAE 0.506,Corr 0.790),并表现出强大的多种子稳定性。相比之下,生成式读出——即使经过等效的监督训练——平均绝对误差增加了一倍以上,产生无法解析或超出范围的输出(零样本下2.8%),并且延迟更高。模态消融实验揭示了CMU-MOSI上的文本主导模式。我们的发现表明,LMM的读出方式与其训练方式同样重要,证明判别性读出为连续MSA提供了更准确、高效和可靠的替代方案。

英文摘要

Multimodal sentiment analysis (MSA) infers human affect from language, acoustic, and visual signals. Recent methods increasingly adapt large multimodal models (LMMs) via generative readout: prompting the model to emit a sentiment score as a text string. While convenient, this ties continuous regression to discrete autoregressive decoding, incurring unmeasured costs. We revisit this readout mechanism and propose a discriminative formulation built on the Thinker module of a native omni-modal LLM (Qwen2.5-Omni-7B). Instead of text decoding, we map the final-layer hidden state of the last non-padding token to a continuous score via a lightweight regression head in a single forward pass. Using 4-bit quantization and low-rank adaptation (QLoRA), the entire 7B pipeline -- including video and audio processing -- trains on a single consumer GPU (RTX 5090, 32 GB) with 10-21 GB peak memory and 1.14% trainable parameters. Through a controlled comparison fixing the backbone, data, and LoRA configuration, we isolate the impact of the readout. On CMU-MOSI and CMU-MOSEI, our discriminative readout reaches state-of-the-art accuracy without task-specific feature engineering (MOSI: MAE 0.551, Corr 0.888; MOSEI: MAE 0.506, Corr 0.790) and exhibits strong multi-seed stability. In contrast, the generative readout -- even after equivalent supervised training -- more than doubles the mean absolute error, yields unparsable or out-of-range outputs (2.8% zero-shot), and suffers from higher latency. Modality ablations reveal a text-dominant regime on CMU-MOSI. Our findings indicate that how an LMM is read out is as consequential as how it is trained, demonstrating that a discriminative readout offers a more accurate, efficient, and reliable alternative for continuous MSA.

2606.05678 2026-06-05 cs.SD cs.AI cs.CR 版本更新

Beyond Waveform Robustness: Robust Feature-Vocoder Adversarial Attacks on Automatic Speech Recognition

超越波形鲁棒性:针对自动语音识别的鲁棒特征-声码器对抗攻击

Yifan Liao, Zongmin Zhang, Zhen Sun, Yuhui Sun, Xinhu Zheng, Xinlei He

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Wuhan University(武汉大学)

AI总结 提出一种基于自监督学习表示和声码器的黑盒对抗攻击方法,通过扰动声学-语音特征而非波形,提高了攻击的可迁移性和对防御的绕过能力。

Comments 11 pages

详情
AI中文摘要

自动语音识别(ASR)系统已广泛用于多语言语音到文本转录。其对对抗攻击的鲁棒性已成为社区的重要课题。现有对抗攻击直接将对抗噪声添加到语音音频中。然而,先前工作表明,现有对抗攻击面临两个限制:它们通常难以迁移到黑盒ASR系统,并且越来越多地被针对输入空间扰动的防御所缓解。在这项工作中,我们提出了一种清洁参考特征-声码器攻击,这是一种基于替代模型的黑盒攻击,将对抗搜索空间从原始波形转移到自监督学习(SSL)表示。为了解决可迁移性限制,我们扰动更具泛化性的声学-语音表示,而不是低层波形样本,减少对替代模型特定波形梯度的依赖,并鼓励对抗扰动跨ASR系统泛化。为了绕过不同的防御,我们将对抗信号从显式的加性波形噪声转移到SSL特征空间扰动,并通过声码器将其重构为类似语音的波形对抗信号,使生成的样本与基于波形的防御不太一致。大量实验表明,当仅在原始Whisper-small作为公开替代模型上优化时,我们的攻击有效迁移到黑盒ASR模型,WER比SOTA基线提高+26.6,同时针对多种训练防御仍保持有效,WER提高+36.2。这些结果揭示了当前ASR鲁棒性评估中的一个盲点。

英文摘要

Automatic speech recognition (ASR) systems have become widely used for multilingual speech-to-text transcription. Their robustness to adversarial attacks has become an important topic for the community. Existing adversarial attacks directly add adversarial noise to the speech audio. However, prior work has shown that existing adversarial attacks face two limitations: they often transfer poorly to black-box ASR systems and are increasingly mitigated by defenses tailored to input-space perturbations. In this work, we propose a Clean-Referenced Feature-Vocoder Attack, a surrogate-based black-box attack that moves the adversarial search space from raw waveforms to self-supervised learning (SSL) representations. To address the transferability limitation, we perturb more generalizable acoustic-phonetic representations rather than low-level waveform samples, reducing dependence on surrogate-specific waveform gradients and encouraging adversarial perturbations that generalize across ASR systems. To bypass different defenses, we shift the adversarial signal from explicit additive waveform noise to SSL feature-space perturbations and reconstruct them through a vocoder into speech-like waveform adversarial signals, making the resulting samples less aligned with waveform-bounded defenses. Extensive experiments show that, when optimized only on raw Whisper-small as a public surrogate model, our attack transfers effectively to black-box ASR models with a +26.6 WER improvement over the SOTA baseline, while also remaining effective against multiple training defenses with a +36.2 WER improvement. These results reveal a blind spot in current ASR robustness evaluation.

2606.05575 2026-06-05 cs.SD eess.AS 版本更新

SB-RF: Schrödinger Bridge Rectified Flow for One-Step Robust Speech Enhancement

SB-RF: 用于一步鲁棒语音增强的薛定谔桥整流流

Caixia Lu, Xueyang Lv, Penglong Hu, Jiaming Xu

发表机构 * Xiaomi Corporation, Beijing, China(小米公司,北京,中国)

AI总结 提出SB-RF,一种结合整流流与薛定谔桥理论的一步生成式语音增强框架,通过熵正则化最优传输构建条件桥,实现高质量一步生成,在VoiceBank-DEMAND基准上达到领先性能,并在低信噪比场景下展现出强鲁棒性和高效率。

详情
AI中文摘要

生成模型在语音增强中表现出令人印象深刻的结果,但通常受限于多步推理。我们提出SB-RF,一种将整流流(RF)与薛定谔桥(SB)理论相结合的一步生成框架。SB-RF通过熵正则化最优传输在干净和带噪语音分布之间构建条件桥。通过RF的速度匹配目标将SB轨迹与最优传输测地线对齐,SB-RF能够通过一步生成实现高质量增强。实验表明,SB-RF在VoiceBank-DEMAND基准上达到了生成方法中的领先性能。此外,为了全面评估在具有挑战性的真实场景中的性能,我们在一个模拟的低信噪比测试集上使用扩大的训练数据集评估SB-RF。在这些条件下,SB-RF展现出强大且具有竞争力的鲁棒性和高效率,验证了其在现实应用中的潜力。

英文摘要

Generative models have shown impressive results in speech enhancement but often suffer from multi-step inference. We propose SB-RF, a one-step generative framework integrating Rectified Flow (RF) with Schrödinger Bridge (SB) theory. SB-RF constructs a conditional bridge between clean and noisy speech distributions via entropy-regularized optimal transport. By aligning SB trajectories with the optimal transport geodesic through the velocity-matching objective of RF, SB-RF enables high-quality enhancement with one-step generation. Experiments demonstrate that SB-RF achieves leading performance among generative methods on the VoiceBank-DEMAND benchmark. Furthermore, to fully assess performance in challenging real-world scenarios, we evaluate SB-RF on a simulated low signal-to-noise ratio test set using an expanded training dataset. Under these conditions, SB-RF exhibits strong and competitive robustness with high efficiency, validating its potential for real-world applications.

2606.05571 2026-06-05 cs.SD eess.AS 版本更新

Sound Effects Dataset Unification With the Universal Category System

使用通用分类系统统一音效数据集

Jun Woo Beck, Alexander Lerch

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一个基于通用分类系统(UCS)的模块化数据集重新标注框架,通过规则驱动的多阶段流水线和冲突解决实现高自动转换率,并创建了包含58,057个音频片段的统一数据集EnvSound-UCS。

Comments DAFx 2026 camera-ready version

详情
AI中文摘要

音效(SFX)数据集和库通常采用不同的标注方案、分类法和元数据结构。这给SFX分类和生成的研究带来了挑战,因为不兼容的分类法导致数据集孤立,可能需要个性化方法,产生不可比较的结果,并阻碍数据合并策略。我们提出了一个模块化的数据集重新标注框架,采用通用分类系统(UCS)——一种行业标准的音效层次分类法——作为共享结构基础。这个开源框架使我们能够(i)通过基于规则的多阶段流水线和冲突解决,将现有数据集的标签转换为UCS,实现高自动转换率;(ii)为新标签建议分层数据集划分;(iii)合并多个数据集。为了展示实际效用,我们引入了EnvSound-UCS数据集,这是一个公开可用的、符合UCS的统一环境声音数据集,包含来自AudioSet、FSD50K和ESC-50三个来源的58,057个音频片段。

英文摘要

Sound effects (SFX) datasets and libraries often employ distinct tagging schemes, taxonomies, and metadata structures. This creates challenges for research on SFX classification and generation because incompatible taxonomies lead to siloed datasets that might require individualized approaches, result in non-comparable outcomes, and prevent data merging strategies. We propose a modular dataset relabeling framework that adopts the Universal Category System (UCS), an industry-standard hierarchical taxonomy for sound effects, as a shared structural foundation. This open-source framework enables us (i) to convert tags of existing datasets to UCS with a rule-based multi-stage pipeline and conflict resolution to achieve high automatic conversion rates, (ii) to suggest a stratified dataset split for the new labels, and (iii) to combine multiple datasets. To showcase the practical utility, we introduce the EnvSound-UCS dataset, a publicly available unified UCS-compliant dataset of environmental sounds with 58,057 sound clips from three sources: AudioSet, FSD50K, and ESC-50.

2606.05569 2026-06-05 cs.CL cs.SD eess.AS 版本更新

Domain-Aware Mispronunciation Detection and Diagnosis Using Language-Specific Statistical Graphs

基于语言特定统计图的领域感知发音错误检测与诊断

Huu Tuong Tu, Hanh Nguyen, Thien Van Luong, Nguyen Tien Cuong, Vu Huan, Nguyen Thi Thu Trang

发表机构 * Hanoi University of Science and Technology(河内理工大学) VNPT AI, VNPT Group(VNPT AI,VNPT集团) National Economics University(国家经济大学)

AI总结 提出一种利用语言特定统计图学习音素混淆模式的方法,在L2-ARCTIC基准上实现59.52%的F1分数,优于多个基线。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

近年来,发音错误检测与诊断(MDD)在计算机辅助语言学习和语音技术中变得越来越重要。本文提出了一种构建统计图的方法,使模型能够学习表示为有向图的音素混淆模式。此外,我们引入了一种语言特定策略,以捕捉不同母语(L1)背景下的系统性发音差异。通过在L2-ARCTIC基准上的大量实验证明了我们方法的有效性,该方法达到了59.52%的F1分数,优于多个竞争基线。

英文摘要

Mispronunciation Detection and Diagnosis (MDD) has gained increasing importance in computer-assisted language learning and speech technology in recent years. In this paper, we propose a method for constructing statistical graphs that enable models to learn phoneme confusion patterns represented as directed graphs. Furthermore, we introduce a language-specific strategy to capture systematic pronunciation differences across various native language (L1) backgrounds. The effectiveness of our approach is demonstrated through extensive experiments on the L2-ARCTIC benchmark, where it achieves an F1-score of 59.52%, outperforming several competitive baselines.

2606.05544 2026-06-05 cs.SD eess.AS 版本更新

Probing Spatial Structure in Pretrained Audio Representations

探究预训练音频表示中的空间结构

Chuyang Chen, Sivan Ding, Adrian S. Roman, Juan Pablo Bello

发表机构 * Music and Audio Research Laboratory, New York University, USA(音乐与音频研究实验室,纽约大学,美国)

AI总结 通过提出SARL基准,系统评估预训练音频模型对空间信息的编码能力,发现源因素比房间因素更易解码,且不同编码器对空间变化响应存在异质性。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

预训练空间音频编码器越来越多地被用作感知任务的通用表示,但其空间编码能力仍知之甚少。我们引入了空间音频表示学习(SARL)基准,这是一个用于评估预训练音频模型中空间信息的受控框架。SARL探测源级因素(方位角、仰角、距离、类别)和房间级因素(RT60、体积、形状)。跨多种编码器的实验揭示了三种模式:输入配置和训练范式塑造空间编码;源因素始终比房间因素更容易解码;在受控扰动下的敏感性分析显示了对源和房间变化的异质性响应。这些结果揭示了当前预训练音频表示中的系统性偏差。SARL作为开源基准发布,用于可重复评估空间音频表示。

英文摘要

Pretrained spatial audio encoders are increasingly used as general-purpose representations for perceptual tasks, yet their spatial encoding capabilities remain poorly understood. We introduce the Spatial Audio Representation Learning (SARL) benchmark, a controlled framework for evaluating spatial information in pretrained audio models. SARL probes source-level factors (azimuth, elevation, distance, class) and room-level factors (RT60, volume, shape). Experiments across diverse encoders reveal three patterns: input configuration and training paradigm shape spatial encoding; source factors are consistently easier to decode than room factors; and sensitivity analysis under controlled perturbations shows heterogeneous responses to source and room variation. These results reveal systematic biases in current pretrained audio representations. SARL is released as an open-source benchmark for reproducible evaluation of spatial audio representations.

2606.05522 2026-06-05 cs.SD cs.AI eess.AS 版本更新

Exploring LLMs for South Asian Music Understanding and Generation

探索大语言模型对南亚音乐的理解与生成

Faria Binte Kader, Mohtasim Hadi Rafi, Shah Wasif Sajjad, Santu Karmaker

发表机构 * University of Central Florida(佛罗里达中央大学) Auburn University(阿伯伯大学)

AI总结 本文系统评估大语言模型在基于拉格和塔拉的南亚古典音乐理解与生成任务中的表现,发现前沿模型在理解任务上准确率达85-90%,但生成任务中风格忠实度仅40%。

Comments 19 pages, 7 figures

详情
AI中文摘要

近年来,大语言模型(LLMs)在音乐理解和生成任务中展现出令人瞩目的成果。然而,现有研究仍局限于西方调性传统,未能揭示当前LLMs能否处理结构独特的低资源音乐传统。我们首次系统评估LLMs在南亚古典音乐中的能力——这种传统由拉格(raga)和塔拉(tala)的旋律约束主导,其结构原则与西方和声驱动音乐根本不同。我们的评估基于印度斯坦古典理论和孟加拉古典形式,包括拉宾德拉(Rabindra)和纳兹鲁尔(Nazrul)歌曲——南亚古典音乐中具有代表性的低资源传统。在音乐理解评估中,我们引入了一个包含504个问答的基准测试,涵盖拉格语法、文化知识和符号记谱推理,评估了33个LLMs,其中前沿模型如Gemini 2.5 Pro达到85-90%的准确率,而大多数开源模型仅在23-40%范围内。在音乐生成方面,我们设计了一个五级受控提示框架,发现即使最强的模型也只有40%的时间能产生风格忠实的输出。这些结果表明,音乐生成中的结构有效性和风格忠实度是不同的目标,并突显了文化基础音乐建模的一个开放挑战。

英文摘要

Recent advancements in Large Language Models (LLMs) have shown promising results in music understanding and generation tasks. However, existing works remain confined to Western tonal traditions, offering little insight into whether current LLMs can handle structurally distinct low-resource musical traditions. We present the first systematic evaluation of LLM competence in South Asian classical music, a tradition governed by raga, tala-based melodic constraints that impose fundamentally different structural principles from Western harmony-driven music. We ground our evaluation in Hindustani classical theory and Bengali classical forms, including Rabindra and Nazrul Sangeet -- representative low-resource traditions within South Asian classical music. For music understanding evaluation, we introduce a 504-question-answer benchmark spanning raga grammar, cultural knowledge, and symbolic notation reasoning, evaluating 33 LLMs where frontier models such as Gemini 2.5 Pro achieve 85-90% accuracy, while most open-source models remain in the 23-40% range. For music generation, we design a five-level controlled prompting framework and find that even the strongest model produces stylistically faithful outputs only 40% of the time. These results reveal that structural validity and stylistic faithfulness in music generation are distinct objectives and highlight an open challenge for culturally grounded music modeling.

2606.05367 2026-06-05 cs.SD eess.AS 版本更新

Task-Vector Arithmetic for Emotional Expressivity Control in Language-Model-Based Text-to-Speech

基于任务向量算术的语言模型文本到语音情感表达控制

Daniel Oliveira de Brito, Arnaldo Candido Junior

发表机构 * Instituto de Biociências, Letras e Ciências Exatas Universidade Estadual Paulista "Júlio de Mesquita Filho" (UNESP)(生物科学、文学和精确科学学院 帕尔马斯州立大学 "Júlio de Mesquita Filho" (UNESP))

AI总结 本文通过系统消融实验定位情感韵律的主要载体为x-vector,并提出一种基于x-vector质心算术的无训练方法,实现跨说话人情感强度控制,在保留身份和可懂度的同时提升情感相似度。

Comments 10 pages, 5 figures

详情
AI中文摘要

我们研究了任务向量算术(在模块化文本到语音(TTS)中成功用于跨说话人情感强度控制)是否能够迁移到基于语言模型骨干和上下文学习(LM-TTS)构建的大规模TTS系统。通过在Qwen3-TTS-12Hz-1.7B上对四个逐渐缩小的操作数——通过LoRA微调的模型权重、连续编解码器嵌入、离散编解码器标记以及由ECAPA-TDNN编码器(与合成骨干联合训练)生成的说话人嵌入(x-vector)——进行系统消融研究,我们将情感韵律的主要载体定位到x-vector。基于这一发现,我们提出了一种基于x-vector空间质心算术的无训练方法:情感方向τ = E_i[x(s_i, emo)] - E_i[x(s_i, neutral)],应用于未见过的目标说话人:x_new = x(target, neutral) + α·τ。使用ESD(英语)作为τ源,emoUERJ(巴西葡萄牙语)作为跨语言真实目标,我们观察到在英语保留说话人上,情感余弦相似度比ICL基线平均提升+0.29,在巴西葡萄牙语保留说话人上提升+0.09,同时很大程度上保留了身份(多说话人τ变体的WavLM SECS ≥ 0.88)和可懂度(PT-BR中WER ≈ 0)。这些结果初步证明,当算术操作作用于说话人嵌入时,可以规避先前报道的基于质心算术的风格控制与基于标记的TTS架构不兼容的问题。

英文摘要

We investigate whether task-vector arithmetic, successful for cross-speaker emotional intensity control in modular text-to-speech (TTS), transfers to large-scale TTS systems built on language-model backbones with in-context learning (LM-TTS). Through a systematic elimination study over four progressively narrower operands on Qwen3-TTS-12Hz-1.7B - model weights via LoRA fine-tuning, continuous codec embeddings, discrete codec tokens, and the speaker embedding (x-vector) produced by an ECAPA-TDNN encoder jointly trained with the synthesis backbone - we localize the dominant carrier of emotional prosody to the x-vector. Building on this finding, we propose a training-free method based on centroid arithmetic in x-vector space: an emotion direction $τ= \mathbb{E}_i[x(s_i,\text{emo})] -\mathbb{E}_i[x(s_i,\text{neutral})]$ applied to an unseen target speaker as $x_{\text{new}} = x(\text{target},\text{neutral}) + α\cdotτ$. Using ESD (English) as the $τ$ source and emoUERJ (Brazilian Portuguese) as a cross-lingual ground-truth target, we observe average gains of $+0.29$ in emotion2vec cosine over the ICL baseline on English held-out speakers and $+0.09$ on Brazilian Portuguese held-out speakers, while largely preserving identity (WavLM SECS $\gtrsim 0.88$ for the multi-speaker $τ$ variant) and intelligibility (WER $\approx 0$ in PT-BR). These results offer initial evidence that the reported incompatibility of centroid-arithmetic style control with token-based TTS architectures may be circumvented when the arithmetic operates on the speaker embedding.

2605.26236 2026-06-05 cs.CV cs.SD 版本更新

DuoGesture: Neuro-Inspired and Biomechanically Informed Dual-Stream Co-Speech Gesture Generation

DuoGesture: 神经启发与生物力学约束的双流共语手势生成

Ferdinand Paar, Lanmiao Liu, Aslı Özyürek, Serge Thill, Esam Ghaleb

发表机构 * Max Planck Institute for Psycholinguistics(马克斯·普朗克心理语言学研究所) Radboud University(拉德堡德大学) Utrecht University(乌得勒支大学)

AI总结 提出DuoGesture,一种神经启发和生物力学约束的双流方法,通过语义变分信息瓶颈协调语义流和节拍流,实现语义表达与生物力学合理的节律运动。

详情
AI中文摘要

共语手势生成需要语义表达性和生物力学合理的节律运动。现有的整体手势模型混合了基于词汇的语义手势和频繁的韵律对齐节拍手势,这限制了语义基础、语音-运动对齐和运动平滑性。我们提出DuoGesture,一种神经启发和生物力学约束的双流方法,将共语手势合成分解为耦合的语义流和节拍流。两个流通过语义变分信息瓶颈协调,这是一个随机帧级门控,学习何时语义手势应覆盖节律节拍运动。语义流由运动基础语义条件控制,该条件用运动-语言表示替代纯语言词嵌入,为手势的长尾词汇触发提供运动对齐的语义先验。节拍流进一步由惯性节拍先验正则化,这是一个基于人体测量学的臂链模块,减少抖动并提高节律一致性而不约束语义帧。客观评估和主观实验表明,DuoGesture优于强整体基线,而组件消融证实了语义基础、随机流选择和生物力学正则化的互补作用。

英文摘要

Co-speech gesture generation requires both semantic expressivity and biomechanically plausible rhythmic motion. Existing holistic gesture models mix lexically grounded semantic gestures with frequent prosody-aligned beat gestures. This limits semantic grounding, speech-motion alignment, and kinematic smoothness. We propose \emph{DuoGesture}, a neuro-inspired and biomechanically informed dual-stream approach that decomposes co-speech gesture synthesis into coupled semantic and beat streams. The two streams are coordinated by a \emph{Semantic Variational Information Bottleneck}, a stochastic frame-level gate that learns when semantic gestures should override rhythmic beat motion. The semantic stream is controlled by \emph{Motion-Grounded Semantic Conditioning}, which replaces purely linguistic word embeddings with motion-language representations to provide motion-aligned semantic priors for long-tailed lexical triggers of gestures. The beat stream is further regularised by an \emph{Inertial Beat Prior}, an anthropometry-weighted arm-chain module that reduces jitter and improves rhythmic consistency without constraining semantic frames. Objective evaluations and subjective experiments show that DuoGesture outperforms strong holistic baselines, while component ablations confirm the complementary roles of semantic grounding, stochastic stream selection, and biomechanical regularisation.

2602.22417 2026-06-05 cs.SD eess.AS 版本更新

Absorbing Discrete Diffusion for Speech Enhancement

吸收式离散扩散用于语音增强

Philippe Gonzalez

发表机构 * Department of Health Technology, Technical University of Denmark(丹麦技术大学健康技术系)

AI总结 本文提出了一种基于吸收式离散扩散的语音增强方法ADDSE,结合神经音频编解码器的潜在空间和扩散模型的非自回归采样过程,以提高低信噪比下的语音增强性能。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

受最近神经语音编码和基于扩散的语言模型发展的启发,我们通过吸收式离散扩散建模清洁语音代码的条件分布来解决语音增强问题。所提出的方法ADDSE结合了神经音频编解码器的表达性潜在空间和扩散模型的非自回归采样过程。为高效建模残差向量量化代码的分层结构,我们提出了RQDiT,结合了RQ-Transformer和扩散Transformer的技术以实现非自回归建模。结果表明,在两个数据集上,该方法在非侵入性客观指标上表现竞争,尤其是在低信噪比和少量采样步骤的情况下。代码和音频示例已在线可用。

英文摘要

Inspired by recent developments in neural speech coding and diffusion-based language modeling, we tackle speech enhancement by modeling the conditional distribution of clean speech codes given noisy speech codes using absorbing discrete diffusion. The proposed approach, which we call ADDSE, leverages both the expressive latent space of neural audio codecs and the non-autoregressive sampling procedure of diffusion models. To efficiently model the hierarchical structure of residual vector quantization codes, we propose RQDiT, which combines techniques from RQ-Transformer and diffusion Transformers for non-autoregressive modeling. Results show competitive performance in terms of non-intrusive objective metrics on two datasets, especially at low signal-to-noise ratios and with few sampling steps. Code and audio examples are available online.

2511.20107 2026-06-05 cs.CL cs.SD eess.AS 版本更新

Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach

无需模型训练的误读检测与诊断:基于检索的方法

Huu Tuong Tu, Ha Viet Khanh, Tran Tien Dat, Vu Huan, Thien Van Luong, Nguyen Tien Cuong, Nguyen Thi Thu Trang

发表机构 * Hanoi National University of Education(河内教育大学)

AI总结 本文提出一种无需模型训练的误读检测与诊断方法,利用预训练的自动语音识别模型和检索技术,实现高准确率的发音错误检测与诊断,实验表明其在L2-ARCTIC数据集上达到69.60%的F1分数。

详情
Journal ref
ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
AI中文摘要

误读检测与诊断(MDD)对于语言学习和语音治疗至关重要。与传统方法需要评分模型或训练音素级模型不同,我们提出了一种新颖的无训练框架,利用预训练的自动语音识别模型和检索技术。我们的方法避免了音素特定建模或额外的任务特定训练,但仍能实现准确的发音错误检测与诊断。在L2-ARCTIC数据集上的实验表明,我们的方法在避免模型训练复杂性的同时,达到了69.60%的F1分数。

英文摘要

Mispronunciation Detection and Diagnosis (MDD) is crucial for language learning and speech therapy. Unlike conventional methods that require scoring models or training phoneme-level models, we propose a novel training-free framework that leverages retrieval techniques with a pretrained Automatic Speech Recognition model. Our method avoids phoneme-specific modeling or additional task-specific training, while still achieving accurate detection and diagnosis of pronunciation errors. Experiments on the L2-ARCTIC dataset show that our method achieves a superior F1 score of 69.60% while avoiding the complexity of model training.

2510.09061 2026-06-05 cs.SD eess.AS 版本更新

O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion

O_O-VC: 基于合成数据驱动的任意到任意语音转换一一对一对齐

Huu Tuong Tu, Huan Vu, cuong tien nguyen, Dien Hy Ngo, Nguyen Thi Thu Trang

发表机构 * VNPT AI, VNPT Group(VNPT AI,VNPT集团) Hanoi University of Science and Technology(河内科学技术大学) Business AI Lab, National Economics University(国家经济大学商业人工智能实验室)

AI总结 本文提出了一种基于合成数据驱动的任意到任意语音转换方法,通过利用高质量预训练多说话人文本到语音模型生成的合成语音数据,学习源语音到目标语音的直接映射,从而在保留语言内容的同时捕捉说话人特定特征,并在零样本场景中提升适应性和性能。

Comments EMNLP 2025

详情
Journal ref
Findings of the Association for Computational Linguistics: EMNLP 2025
AI中文摘要

传统语音转换(VC)方法通常试图将说话人身份和语言信息分离为不同的表示,然后将这些表示组合起来重建音频。然而,有效解耦这些因素仍然具有挑战性,往往导致训练过程中的信息丢失。在本文中,我们提出了一种新的方法,利用由高质量预训练多说话人文本到语音(TTS)模型生成的合成语音数据。具体而言,使用共享相同语言内容但说话人身份不同的合成数据对作为输入-输出对来训练语音转换模型。这使模型能够学习源语音和目标语音之间的直接映射,从而有效捕捉说话人特定特征的同时保留语言内容。此外,我们引入了一种灵活的训练策略,用于任意到任意语音转换,该策略在未见过的说话人和新语言上泛化良好,增强了在零样本场景中的适应性和性能。我们的实验表明,所提出的方法在词错误率上实现了16.35%的相对减少,并在说话人余弦相似度上提升了5.91%,优于几种最先进的方法。语音转换样本可访问:https://oovc-emnlp-2025.github.io/

英文摘要

Traditional voice conversion (VC) methods typically attempt to separate speaker identity and linguistic information into distinct representations, which are then combined to reconstruct the audio. However, effectively disentangling these factors remains challenging, often leading to information loss during training. In this paper, we propose a new approach that leverages synthetic speech data generated by a high-quality, pretrained multispeaker text-to-speech (TTS) model. Specifically, synthetic data pairs that share the same linguistic content but differ in speaker identity are used as input-output pairs to train the voice conversion model. This enables the model to learn a direct mapping between source and target voices, effectively capturing speaker-specific characteristics while preserving linguistic content. Additionally, we introduce a flexible training strategy for any-to-any voice conversion that generalizes well to unseen speakers and new languages, enhancing adaptability and performance in zero-shot scenarios. Our experiments show that our proposed method achieves a 16.35% relative reduction in word error rate and a 5.91% improvement in speaker cosine similarity, outperforming several state-of-the-art methods. Voice conversion samples can be accessed at: https://oovc-emnlp-2025.github.io/

2412.06259 2026-06-05 eess.AS cs.SD 版本更新

Leveraging Prompt Learning and Pause Encoding for Alzheimer's Disease Detection

利用提示学习和暂停编码进行阿尔茨海默病检测

Yin-Long Liu, Rui Feng, Jia-Hong Yuan, Zhen-Hua Ling

发表机构 * National Social Science Foundation of China(中华人民共和国国家社会科学基金) Supercomputing Center of the USTC(中国科学技术大学超算中心)

AI总结 本文提出通过提示学习和暂停信息编码改进基于转录文本的阿尔茨海默病检测,利用提示模板将分类任务转化为掩码语言建模任务,并通过比较不同自动语音识别模型和集成技术,达到95.8%的检测准确率。

Comments Accepted by ISCSLP 2024

详情
Journal ref
Proc. IEEE ISCSLP 2024, pp. 486-490, 2024
AI中文摘要

与其它临床筛查技术相比,基于语音和语言的自动化阿尔茨海默病(AD)检测方法具有非侵入性、成本效益和便利性。先前研究已证明微调预训练语言模型(PLMs)在AD检测中的有效性。然而,传统微调方法仅输入转录文本,其目标与PLMs预训练阶段使用的掩码语言建模(MLM)任务不一致。本文研究了基于提示的PLMs微调方法,通过在转录输入中插入提示模板将分类任务转化为MLM任务。同时探索了将强制对齐中的暂停信息纳入手动转录的影响。此外,我们比较了各种自动语音识别(ASR)模型的性能,并选择Whisper模型生成基于ASR的转录文本与手动转录进行比较。此外,跨不同PLMs(BERT和RoBERTa)使用不同随机种子应用多数投票和集成技术。最终,使用手动转录文本获得最大检测准确率为95.8%(均值87.9%,标准差3.3%),在ADReSS测试集上实现了仅使用转录文本进行AD检测的最先进性能。

英文摘要

Compared to other clinical screening techniques, speech-and-language-based automated Alzheimer's disease (AD) detection methods are characterized by their non-invasiveness, cost-effectiveness, and convenience. Previous studies have demonstrated the efficacy of fine-tuning pre-trained language models (PLMs) for AD detection. However, the objective of this traditional fine-tuning method, which involves inputting only transcripts, is inconsistent with the masked language modeling (MLM) task used during the pre-training phase of PLMs. In this paper, we investigate prompt-based fine-tuning of PLMs, converting the classification task into a MLM task by inserting prompt templates into the transcript inputs. We also explore the impact of incorporating pause information from forced alignment into manual transcripts. Additionally, we compare the performance of various automatic speech recognition (ASR) models and select the Whisper model to generate ASR-based transcripts for comparison with manual transcripts. Furthermore, majority voting and ensemble techniques are applied across different PLMs (BERT and RoBERTa) using different random seeds. Ultimately, we obtain maximum detection accuracy of 95.8% (with mean 87.9%, std 3.3%) using manual transcripts, achieving state-of-the-art performance for AD detection using only transcripts on the ADReSS test set.