arXivDaily arXiv每日学术速递 周一至周五更新

1. 语音识别与关键词检测 3 篇

2606.19381 2026-06-19 cs.SD cs.AI 新提交

Improving Code-Switching ASR with Code-Mixing Guided Synthetic Speech

利用语码混合引导的合成语音改进语码转换语音识别

Yue Heng Yeo, Haoyang Li, Yizhou Peng, Shreyas Gopal, Hexin Liu, Leibny Paola Garcia-Perera, Hardik B. Sailor, Jeremy H. M. Wong, Eng Siong Chng

发表机构 * College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) Google DeepMind(谷歌深度思维)

AI总结 针对语码转换语音识别中高质量文本-语音对稀缺的问题,提出语码混合引导的偏好学习框架,通过语码混合指数优化合成语音的转换保真度,在SEAME语料库上微调Whisper Large,将混合错误率从12.1%/17.8%降至8.9%/14.2%。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

语码转换语音识别由于缺乏高质量的语码转换文本-语音对用于训练而仍然具有挑战性。尽管已经探索了通过文本到语音进行合成数据增强,但现有的语码转换文本到语音方法主要优化重建保真度,并未明确强制语言边界一致性,从而限制了它们在语码转换语音识别增强中的有效性。本文提出了一种语码混合引导的偏好学习框架,该框架利用语码混合指数引导合成语音生成,以提高语码转换保真度。在SEAME汉英口语语料库上的实验表明,所提方法增强了合成数据在语音识别微调中的效用。具体来说,当微调Whisper Large时,所提方法在DevMAN和DevSGE测试集上分别将混合错误率从12.1%/17.8%降低到8.9%/14.2%。

英文摘要

Code-switch (CS) Automatic Speech Recognition (ASR) remains challenging due to limited availability of high quality CS text-speech pairs for training. Although synthetic data augmentation via Text-to-speech (TTS) has been explored, existing CS TTS approaches primarily optimise reconstruction fidelity and do not explicitly enforce language-boundary consistency, thereby limiting their effectiveness for CS ASR augmentation. This paper proposes a code-mixing guided preference-learning framework that steers synthetic speech generation toward improved code-switching fidelity using the Code Mixing Index (CMI). Experiments on the SEAME Mandarin-English conversational corpus demonstrate that the proposed method enhances the utility of synthetic data for ASR fine-tuning. Specifically, when fine-tuning Whisper Large, the proposed approach reduces Mixed Error Rate (MER) from 12.1%/17.8% to 8.9%/14.2% on the DevMAN and DevSGE sets, respectively.

2606.19398 2026-06-19 cs.SD eess.AS eess.SP 新提交

S-JEPA : Soft Clustering Anchors for Self-Supervised Speech Representation Learning

S-JEPA:用于自监督语音表示学习的软聚类锚点

Georgios Ioannides, Adrian Kieback, Judah Goldfeder, Linsey Pang, Aman Chadha, Aaron Elkins, Yann LeCun, Ravid Shwartz-Ziv

发表机构 * Carnegie Mellon University(卡内基梅隆大学) New York University(纽约大学) James Silberrad Brown Center for AI(詹姆斯·西尔伯拉德·布朗人工智能中心) Columbia University(哥伦比亚大学) Northeastern University(东北大学) Stanford University(斯坦福大学) Amazon GenAI(亚马逊生成式人工智能)

AI总结 提出S-JEPA,通过KL散度匹配高斯混合模型的软后验概率训练编码器-预测器对,无需离线重聚类或教师蒸馏,在SUPERB协议下以低于90M参数取得最低WER,并建立新的帕累托前沿。

详情
AI中文摘要

自监督语音编码器主要通过预测掩蔽位置处的离散硬聚类ID进行训练,这种方法会坍缩类别边界处的声学模糊性,并需要在迭代之间中断训练以对整个语料库进行重聚类。我们提出S-JEPA,一种JEPA风格的编码器-预测器对,通过KL散度训练以匹配掩蔽位置处高斯混合模型的软后验概率。训练作为连续优化轨迹分两个阶段进行:首先在MFCC特征上使用固定GMM,然后在编码器特征上使用在线GMM,输入层从无标签信号中自适应选择,从而消除了离线重聚类步骤以及手动选择聚类所在Transformer层的问题。在SUPERB协议下,S-JEPA在评估的低于90M参数的自监督方法中实现了最低的词错误率(WER),并在大约一半参数量的情况下在情感识别任务上与HuBERT-Base相当,无需离线重聚类或教师蒸馏即建立了新的帕累托前沿。对预测器在保留语音上的每帧熵的分析揭示了双峰分布,其中相当一部分帧的熵接近完美两聚类平局的熵,这直接经验性地证明了软目标目标保留了硬目标会坍缩的声学模糊性。代码可在以下网址获取:https://this https URL。

英文摘要

Self-supervised speech encoders are predominantly trained by predicting discrete hard cluster IDs at masked positions, a recipe that collapses acoustic ambiguity at category boundaries and requires interrupting training to re-cluster the entire corpus between iterations. We introduce S-JEPA, a JEPA-style encoder-predictor pair trained to match the soft posteriors of a Gaussian Mixture Model at masked positions via KL divergence. Training runs as one continuous optimization trajectory in two phases: a fixed GMM over MFCC features, then an online GMM over encoder features, with the input layer selected adaptively from a label-free signal, removing both the offline re-cluster step and the hand-tuned choice of which transformer layer to cluster on. Under the SUPERB protocol, S-JEPA achieves the lowest WER among evaluated SSL methods below 90M parameters and matches HuBERT-Base on emotion recognition at roughly half its parameter count, establishing a new Pareto frontier without offline re-clustering or teacher distillation. An analysis of the predictor's per-frame entropy on held-out speech reveals a bimodal distribution with a substantial minority of frames near the entropy of a perfect two-cluster tie, providing direct empirical evidence that the soft-target objective preserves the acoustic ambiguity that hard targets would collapse. Code is available at https://github.com/gioannides/s-jepa.

2606.19996 2026-06-19 cs.SD cs.CL 新提交

Segment-Level Mandarin Chinese Speech-Based Cognitive Impairment Detection via an Autoencoder with Contrastive Learning

基于自编码器与对比学习的段级普通话语音认知障碍检测

Yongqi Shao, Hong Huo, Flavio Bertini, Danilo Montesi, Tao Fang

发表机构 * School of Automation and Intelligent Sensing, Shanghai Jiao Tong University(上海交通大学自动化与智能感知学院) Key Laboratory of System Control and Information Processing, Ministry of Education of China(教育部系统控制与信息处理重点实验室) Shanghai Key Laboratory of Perception and Control in Industrial Network Systems(上海市工业网络系统感知与控制重点实验室) Department of Computer Science and Engineering, University of Bologna(博洛尼亚大学计算机科学与工程系) Department of Mathematical, Physical and Computer Sciences, University of Parma(帕尔马大学数学、物理与计算机科学系)

AI总结 提出段级表示学习框架,结合自编码器和对比学习,在四个普通话数据集上实现稳定的二分类和三分类认知障碍检测,尤其改善了临床困难的三分类性能。

Comments 15 pages, 7 figures, 5 tables

详情
AI中文摘要

\noindent\textbf{背景与目标:} 语音已成为一种低成本、非侵入性的数字生物标志物,在认知障碍检测方面具有巨大潜力。然而,有限的标注数据和跨数据集变异性仍然是构建稳健的语音筛查系统的主要挑战。\par\noindent\textbf{方法:} 我们开发了一个用于语音认知障碍检测的段级表示学习框架。将语音录音分割成短片段并转换为语谱图表示。为了在有限数据条件下提高鲁棒性,将离线和在线增强策略与基于自编码器的表示学习和对比目标相结合,以增强判别性潜在表示。\par\noindent\textbf{结果:} 在四个独立的普通话语音数据集上进行的实验表明,在二分类和三分类任务中均取得了稳定且有竞争力的性能,尤其是在临床具有挑战性的三分类设置中取得了显著改进。消融研究进一步支持了所提框架的有效性。\par\noindent\textbf{结论:} 研究结果表明,段级语音表示学习可能为资源受限的临床环境中的认知障碍筛查提供一种可扩展且实用的方法。

英文摘要

\noindent\textbf{Background and Objective:} Speech has emerged as a low-cost and non-invasive digital biomarker with considerable potential for cognitive impairment detection. However, limited labeled data and cross-dataset variability remain major challenges for robust speech-based screening systems. \par\noindent\textbf{Methods:} We developed a segment-level representation learning framework for speech-based cognitive impairment detection. Speech recordings were divided into short segments and converted into spectrogram representations. To improve robustness under limited-data conditions, offline and online augmentation strategies were combined with autoencoder-based representation learning and contrastive objectives to enhance discriminative latent representations. \par\noindent\textbf{Results:} Experiments conducted on four independent Mandarin Chinese speech datasets demonstrated stable and competitive performance in both binary and three-class classification tasks, with particularly notable improvements in the clinically challenging three-class setting. Ablation studies further supported the effectiveness of the proposed framework. \par\noindent\textbf{Conclusions:} The findings suggest that segment-level speech representation learning may provide a scalable and practical approach for cognitive impairment screening in resource-constrained clinical settings.

2. 语音合成与声音生成 5 篇

2606.18485 2026-06-19 cs.SD cs.AI eess.AS 新提交

MagpieTTS-LF: Inference-Time Long-Form Speech Generation Without Training on Long-Form data

MagpieTTS-LF:无需长语音数据训练的推理时长生成长语音生成

Subhankar Ghosh, Jason Li, Paarth Neekhara, Shehzeen Hussain, Ryan Langman, Xuesong Yang, Roy Fejgin

发表机构 * NVIDIA Corporation(英伟达公司)

AI总结 提出MagpieTTS-LF推理时方法,通过软注意力先验、有状态推理和历史感知文本编码,在不重新训练模型的情况下实现连贯的长语音生成。

Journal ref Interspeech 2026

详情
AI中文摘要

神经文本到语音(TTS)系统在短语句上取得了显著质量,但长语音生成表现出韵律漂移、说话人不一致和句子边界伪影。现有方法要么压缩序列、增加上下文长度,要么简单拼接独立合成的片段。我们提出一种称为MagpieTTS-LF的推理时方法,使MagpieTTS能够在不重新训练模型的情况下生成连贯的长语音。我们的方法引入了三个关键创新:(1)软注意力先验,在保留过去和未来上下文的同时引导单调对齐;(2)有状态推理算法,跨句子块维护上下文,确保韵律连续性;(3)历史感知文本编码,利用过去文本进行语篇级韵律规划。在长文本上的实验表明,与其他基线相比,在长距离可懂度、韵律连贯性、说话人一致性和边界自然度方面有显著改进。

英文摘要

Neural Text-to-Speech (TTS) systems achieve remarkable quality on short utterances but long-form speech generation shows prosodic drift, speaker inconsistencies and sentence boundary artifacts. Existing approaches either compress sequences, increase context length or naively concatenate independently synthesized chunks. We present an inference-time approach called MagpieTTS-LF that enables MagpieTTS to produce coherent long-form speech without model retraining. Our method introduces three key innovations: (1) soft attention priors to guide monotonic alignment while preserving past and future context; (2) a stateful inference algorithm that maintains context across sentence chunks, ensuring prosodic continuity; (3) history-aware text encoding that uses past text for discourse-level prosodic planning. Experiments on long texts show significant improvements in long-range intelligibility, prosodic coherence, speaker consistency, and boundary naturalness compared to other baselines.

2606.19629 2026-06-19 cs.SD cs.AI cs.LG 新提交

RIVET: Robust Idempotent Voice Attribute Editing

RIVET: 鲁棒的幂等语音属性编辑

Dareen Alharthi, Bhuvan Koduru, Rita Singh, Bhiksha Raj

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出RIVET训练框架,通过幂等性正则化提升语音属性编辑模型对标签噪声的鲁棒性,在合成噪声和真实噪声数据集上均优于标准训练。

详情
AI中文摘要

语音属性编辑模型在保留说话人身份的同时修改年龄和性别等特征。然而,在大规模语音数据集中,属性标注通常带有噪声或不一致,这可能导致条件生成模型产生不稳定的编辑。在这项工作中,我们证明幂等性为提升对噪声标签的鲁棒性提供了一种有效机制。幂等算子是指重复应用不会改变结果的算子,即 f(f(x)) = f(x)。强制这一性质作为一种隐式正则化器,降低了对错误标注样本的敏感性。我们引入了 RIVET,一种结合幂等性目标以提升对标签噪声鲁棒性的训练框架。我们在受控标签噪声下以及在具有自然噪声标注的 GLOBE 数据集上评估了 RIVET。RIVET 提高了编辑成功率,并且比标准训练更好地保留了说话人身份,表明幂等性提升了语音编辑模型的鲁棒性。

英文摘要

Voice attribute editing models modify characteristics such as age and gender while preserving speaker identity. In large-scale speech datasets, however, attribute annotations are often noisy or inconsistent, which can cause conditional generative models to produce unstable edits. In this work, we show that idempotency provides an effective mechanism for improving robustness to noisy labels. An idempotent operator is one for which repeated application does not change the result, i.e., f(f(x)) = f(x). Enforcing this property acts as an implicit regularizer that reduces sensitivity to mislabeled examples. We introduce RIVET, a training framework that incorporates an idempotency objective to improve robustness to label noise. We evaluate RIVET under controlled label noise and on the GLOBE dataset with naturally noisy annotations. RIVET improves editing success and better preserves speaker identity than standard training, showing that idempotency improves robustness in voice editing models.

2606.19792 2026-06-19 cs.SD 新提交

Exploring Pre-training Benefits on Phoneme Addition through Fine-tuning in Speech Synthesis

探索预训练在语音合成中通过微调对音素添加的益处

Masato Murata, Koichi Miyazaki, Tomoki Koriyama, Tomoki Toda

发表机构 * CyberAgent, Japan(日本CyberAgent公司) Nagoya University, Japan(日本名古屋大学)

AI总结 研究预训练模型在微调过程中添加新音素时的表现,发现预训练主要提升自然度,但对新音素添加的益处有限。

Comments Accepted by INTERSPEECH 2026

详情
AI中文摘要

迁移学习广泛用于低资源文本到语音合成。当目标语料包含预训练中未见过的音素时,模型必须在微调期间扩展其音素库存;我们称此过程为“音素添加”。然而,尚不清楚预训练生成已见音素的能力是否有助于此过程。本研究在两个设置中调查音素添加:(1)使用LLM生成的音素控制语料库的模拟设置,可以在不考虑混杂因素的情况下进行研究,以及(2)真实语音跨语言迁移设置(英语到日语),以验证发现是否在实践中成立。两个设置中的实验表明,虽然微调比从头训练实现了更高的自然度,但需要相同或更多的数据才能达到与新音素相当的PER。这些结果表明,预训练主要有助于自然度提升,但对音素添加的益处有限。

英文摘要

Transfer learning is widely used for low-resource text-to-speech. When the target corpus contains phonemes unseen in pre-training, the model must expand its phoneme inventory during fine-tuning; we call the process "phoneme addition." However, it remains unclear whether the pre-trained ability to generate seen phonemes contributes to this process. This study investigates phoneme addition in two settings: (1) a simulation setup using LLM-generated phoneme-controlled corpora that enables investigation without considering confounding factors, and (2) a real-speech cross-lingual transfer setup (English to Japanese) to validate whether the findings hold in practice. Experiments in both settings showed that while fine-tuning achieved higher naturalness than training from scratch, it required as much or more data to achieve comparable PER for new phonemes. These results indicate that pre-training mainly contributes to naturalness improvement, but offers limited benefit for phoneme addition.

2606.20101 2026-06-19 cs.SD cs.AI cs.MM 新提交

Hybrid Diffusion Transformer for Instruction-Guided Audio Editing via Rectified Flow

基于整流流的混合扩散变压器用于指令引导音频编辑

Liting Gao, Yonggang Zhu, Yaru Chen, Dongyu Wang, Shubin Zhang, Zhenbo Li, Jean-Yves Guillemaut, Wenwu Wang

发表机构 * Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey(萨里大学视觉、语音与信号处理中心) School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) Fisheries College, Ocean University of China(中国海洋大学水产学院) College of Information and Electrical Engineering, China Agricultural University(中国农业大学信息与电气工程学院)

AI总结 提出混合两阶段扩散变压器架构,通过粗到细策略平衡全局语义对齐与局部细节编辑,在重叠音频事件和复杂指令任务上提升性能与效率。

详情
AI中文摘要

音频编辑旨在根据自然语言指令修改现有音频剪辑中的特定内容,同时保留其余声学内容。尽管扩散模型取得了显著进展,但现有的基于训练的编辑方法主要依赖于卷积U-Net骨干中的局部归纳偏差和交叉注意力交互,这通常阻碍了长程语义对齐以及对指令的精确理解和定位。相比之下,扩散变压器提供了更强的全局建模和多模态融合,但现有的编辑架构通常采用MMDiT和DiT块的简单堆叠。在所有块中对拼接的音频和文本标记应用联合注意力会导致相对于标记长度的二次复杂度。为了平衡编辑性能和效率,我们提出了一种基于整流流匹配的混合两阶段扩散变压器架构,用于指令引导音频编辑。它在低分辨率阶段对音频和文本标记进行联合注意力以建立粗略的语义对齐,然后在高分辨率阶段切换到交替的联合注意力和交叉注意力块以细化编辑细节。这种从粗到细的策略实现了高效且准确的指令引导音频编辑。实验表明,所提出的框架在涉及重叠音频事件和复杂指令的具有挑战性的编辑任务上取得了显著的性能提升,同时通过紧凑模型大幅提高了编辑效率。

英文摘要

Audio editing aims to modify specific content in an existing audio clip according to a natural language instruction while preserving the remaining acoustic content. Despite the remarkable progress of diffusion models, existing training-based editing methods mainly rely on the local inductive biases and cross-attention interaction in convolutional U-Net backbones, which often hinder long-range semantic alignment and precise understanding and localization of instructions. In contrast, diffusion transformers provide stronger global modeling and multimodal fusion, but existing editing architectures usually adopt a simple stack of MMDiT and DiT blocks. Applying joint attention over concatenated audio and text tokens in all blocks results in quadratic complexity with respect to token length. To balance editing performance and efficiency, we propose a hybrid two-stage diffusion transformer architecture for instruction-guided audio editing based on rectified flow matching. It performs joint attention over audio and text tokens to establish coarse semantic alignment at low-resolution stage, then switches to alternating joint-attention and cross-attention blocks to refine editing details at high-resolution stage. This coarse-to-fine strategy enables efficient and accurate instruction-guided audio editing. Experiments show that the proposed framework achieves notable performance gains on challenging editing tasks involving overlapping audio events and complex instructions, while substantially improving editing efficiency with a compact model.

2606.20218 2026-06-19 cs.SD 新提交

Zero-VC: Zero-Lookahead Streaming Voice Conversion via Speaker Anonymization

Zero-VC: 通过说话人匿名化实现零前瞻流式语音转换

Yudong Li, Zihao Fang, Junwen Qiu, Ruihai Jing, Ruixiang Hang, Yingda Shen, Zhizheng Wu

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Shenzhen Loop Area Institute(深圳环域研究所) Shenzhen Transsion Holdings Co., Ltd.(深圳传音控股股份有限公司)

AI总结 针对流式零样本语音转换中音色与语言内容解耦的挑战,提出将说话人匿名化作为扰动机制,在保留韵律效用的同时显式减轻音色泄露,实现严格因果的零前瞻网络。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

流式零样本语音转换在不解耦音色与语言内容的情况下,难以避免降低效用或增加延迟。当前方法依赖于信息瓶颈(IB)或说话人扰动。虽然IB过滤了音色,但它丢弃了韵律,迫使模型显式注入基频等特征。这通常需要缓冲未来帧,产生算法前瞻延迟。另一方面,现有的扰动方法在很大程度上忽略了音色泄露与效用保留之间的关键权衡。认识到这一被忽视的权衡,我们发现说话人匿名化(SA)的内在目标与平衡这些因素高度一致。因此,我们引入SA作为一种新颖的扰动机制,在保留韵律效用的同时显式减轻音色泄露。关键在于,SA的鲁棒表示显著减轻了生成器对未来上下文的依赖,使我们能够实现严格因果的零前瞻网络。音频样本可在此https URL获取。

英文摘要

Streaming zero-shot voice conversion struggles to disentangle timbre from linguistic content without degrading utility or inflating latency. Current methods rely on information bottleneck (IB) or speaker perturbation. While IB filters out timbre, it discards prosody, forcing models to explicitly inject features like fundamental frequency. This often requires buffering future frames, creating algorithmic lookahead latency. On the other hand, existing perturbation methods largely overlook the crucial trade-off between timbre leakage and utility preservation. Recognizing this neglected trade-off, we find that the inherent objective of Speaker Anonymization (SA) aligns well with balancing these factors. Thus, we introduce SA as a novel perturbation mechanism to explicitly mitigate timbre leakage while retaining prosodic utility. Crucially, SA's robust representations significantly alleviate the generator's reliance on future context, enabling our strictly causal, zero-lookahead network. Audio samples are available at https://amphionteam.github.io/Zero-VC-demo/.

3. 语音增强、降噪与音频修复 1 篇

2606.19688 2026-06-19 cs.SD eess.AS 新提交

Latency-Configurable Streaming Speech Enhancement via Asymmetric Temporal Padding

通过非对称时间填充实现延迟可配置的流式语音增强

Yunsik Kim, Yoonyoung Chung

发表机构 * Department of Electrical Engineering, Pohang University of Science and Technology (POSTECH)(电气工程系,浦项科技大学) Intus Co. Ltd.(Intus有限公司)

AI总结 提出LaCo-SENet,通过非对称时间填充和双缓冲流式机制,在单一超参数下实现延迟与质量的灵活权衡,在VoiceBank+DEMAND上以1.37M参数获得12.5-75.0ms延迟范围,PESQ从3.35到3.43。

Comments 5 pages, 3 figures. Accepted for presentation at Interspeech 2026

详情
AI中文摘要

流式语音增强需要在算法延迟和质量之间取得平衡,但现有方法大多将其视为因果与非因果的二元选择。LaCo-SENet通过单个训练时超参数参数化的两种机制解决了这个问题。首先,非对称时间填充重新分配卷积中的过去和未来上下文,实现系统性的延迟配置。其次,双缓冲流式结合了过去上下文的状体缓冲区和在输入和特征层面提供未来上下文的超前缓冲区。选择性状态更新还防止未来帧泄漏到流式状态中,确保训练-推理一致性。在VoiceBank+DEMAND上,固定预算(1.37M参数)的主干网络产生了覆盖12.5-75.0毫秒的模型系列,PESQ从3.35上升到3.43。在仅12.5毫秒(完全因果)时,PESQ为3.35,达到或超过了先前的因果最先进水平(46.5毫秒时为3.27)。

英文摘要

Streaming speech enhancement requires balancing algorithmic latency against quality, yet existing approaches largely treat this as a binary causal versus non-causal choice. LaCo-SENet addresses this issue with two mechanisms parameterized by a single training-time hyperparameter. First, asymmetric temporal padding redistributes past and future context in convolutions, enabling systematic latency configuration. Second, dual-buffer streaming combines state buffers for past context with lookahead buffers that supply future context at both the input and feature levels. Selective state updates also prevent future-frame leakage into the streaming state, ensuring training-inference consistency. On VoiceBank+DEMAND, a fixed-budget (1.37M parameters) backbone yields a family of models spanning 12.5-75.0 ms, with PESQ rising from 3.35 to 3.43. At just 12.5 ms (fully causal), a PESQ of 3.35 matches or exceeds the prior causal state-of-the-art (3.27 at 46.5 ms).

4. 音频事件检测与场景理解 1 篇

2606.19568 2026-06-19 cs.SD cs.AI 新提交

Exploring Feature Extraction Technique Parameters for Acoustic Gunshot Classification

声学枪声分类的特征提取技术参数探索

Sinclair Gurny, Ryan Quinn

AI总结 本文系统研究了特征提取技术及其参数对声学枪声分类的影响,使用ResNet-18在23000条枪声数据集上评估,发现正确技术可提升top-1准确率20%,参数优化可再提升4.7%。

详情
AI中文摘要

声学枪声检测是一个在民用公共安全、军事行动和野生动物保护中都有应用的问题,但该领域缺乏对特征提取技术的严格探索,且未关注对现实数据的泛化能力。商业枪声检测与分类系统的混合有效性表明,当前文献未能充分解决这一开放问题。在本文中,我们使用包含85种枪械和21种口径的23000条枪声记录数据集,对常见特征提取技术进行了系统研究。我们使用ResNet-18对三种特征提取技术及其12个独特参数集进行了基准测试。结果表明,使用正确的特征提取技术可将top-1准确率提升高达20%,而针对给定特征提取技术使用正确的参数可进一步提升高达4.7%。

英文摘要

Acoustic gunshot detection is a problem with applications across civilian public safety, military operations, and wildlife conservation, yet the field lacks a rigorous exploration of feature extraction techniques with a focus on generalization to realistic data. The mixed effectiveness of commercial gunshot detection and classification systems indicates an open problem that is not adequately addressed by the current literature. In this paper, we present a systematic investigation of common feature extraction techniques using a dataset of 23,000 gunshot recordings across 85 firearms and 21 calibers. We benchmark three feature extraction techniques with 12 total unique parameter sets using ResNet-18. Our results demonstrate that using the correct feature extraction technique can improve top-1 accuracy by up to 20%, and utilizing the correct parameters for a given feature extraction technique can improve that value by up to 4.7%.

5. 多模态音频与视听学习 1 篇

2606.20418 2026-06-19 cs.SD 新提交

MixProLAP: Mixture-Induced Uncertainty Modeling for Probabilistic Language-Audio Pretraining

MixProLAP:混合诱导的不确定性建模用于概率性语言-音频预训练

Yu Nakagome, Jaesong Lee, Soo-Whan Chung

发表机构 * LINE WORKS Corporation(LINE WORKS公司) NAVER Cloud Corporation(NAVER Cloud公司)

AI总结 提出概率性音频-语言预训练框架MixProLAP,通过混合音频-文本对模拟重叠声音,建模多对多对应不确定性,并引入多级包含损失,在音频-文本检索中优于确定性基线。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

声学环境通常包含多个重叠的声音事件,且同一声学场景可以用不同的文本描述,使得音频-文本对齐存在固有的模糊性。本文提出一种概率性音频-语言预训练框架,用于建模音频-文本对齐中的多对多对应不确定性。与学习确定性点嵌入的传统对比方法不同,我们的方法将每个模态表示为分布,并学习不确定性感知的跨模态对齐。我们不依赖基于掩码的不确定性模拟,而是混合音频-文本对以创建更真实反映实际声学混合的重叠声音,并捕捉声音事件之间的语义包含关系。我们进一步引入多级包含损失,以强制表示与这些关系一致。在音频-文本检索基准上的实验表明,所提方法优于确定性基线。

英文摘要

Acoustic environments often contain multiple overlapping sound events, and the same acoustic scene can be described using diverse textual expressions, making audio-text alignment inherently ambiguous. This paper proposes a probabilistic audio-language pretraining framework to model many-to-many correspondence ambiguity in audio-text alignment. Unlike conventional contrastive methods that learn deterministic point embeddings, our approach represents each modality as a distribution and learns uncertainty-aware cross-modal alignment. Rather than relying on masking-based uncertainty simulation, we mix audio-text pairs to create overlapping sounds that better reflect real acoustic mixtures and capture semantic inclusion relations among sound events. We further introduce a multi-level inclusion loss to enforce representations consistent with these relations. Experiments on audio-text retrieval benchmarks show that the proposed method outperforms deterministic baselines.

6. 数据集、基准与评测 2 篇

2606.19597 2026-06-19 cs.SD cs.AI cs.LG 新提交

PrefSQA: Pairwise Preference Prediction for Speech Quality Assessment and the Critical Role of High Quality Datasets

PrefSQA: 用于语音质量评估的成对偏好预测及高质量数据集的关键作用

Junyi Fan, Donald S. Williamson

发表机构 * Department of Computer Science and Engineering, The Ohio State University, USA(美国俄亥俄州立大学计算机科学与工程系)

AI总结 提出PrefSQA模型,通过不确定性感知logits、损伤注意力头和非匹配参考比较模块,利用高质量偏好数据集提升语音质量评估的准确性。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

平均意见得分(MOS)广泛用于语音质量评估,但标量标签对评估者变异性和听力测试差异敏感,这引入了标签噪声,限制了MOS预测的可靠性。偏好预测通过让听者直接比较信号来减少这种变异性,产生更干净的标签。我们研究了无MOS的偏好预测,并提出了PrefSQA,它结合了不确定性感知logits、损伤注意力头以及基于非匹配参考比较的模块。我们使用并精炼了五个数据集,包括MOS衍生和低噪声模拟集(包含匹配和非匹配内容),在人类偏好集上进行实验,并在未见数据上测试。实验表明,在MOS衍生数据上改进较小,而其他数据集显示出相对于基线的明显改进,突显了高质量偏好数据的价值,并证明了所提出方法的有效性。

英文摘要

Mean opinion scores (MOS) are widely used for speech quality assessment, yet scalar labels are sensitive to rater variability and listening test differences. This introduces labeling noise, which limits the reliability of MOS prediction. Preference prediction reduces this variability as listeners compare signals directly, producing cleaner labels. We study MOS-free preference prediction and propose PrefSQA, which incorporates uncertainty-aware logits, an impairment attention head, and a module based on non-matching-reference comparisons. We use and refine five datasets, including MOS-derived and low-noise simulated sets with matching and non-matching content, experiment with human preference sets, and test on unseen data. Experiments show small improvements on MOS-derived data, while other sets reveal clear improvement over the baselines, highlighting the value of high-quality preference data and demonstrating the effectiveness of the proposed method.

2606.19987 2026-06-19 cs.SD eess.AS 新提交

PolSeT: Polish Semantics of Timbre Dataset

PolSeT: 波兰语音色语义数据集

Jan Jasiński

AI总结 介绍PolSeT数据集,通过自由言语化和语义差异实验,收集波兰语语义描述符和音色评分,填补音色研究数据空白,支持跨文化心理声学和MIR研究。

Comments 8 pages, 7 figures. Data descriptor for the PolSeT dataset (Polish Semantics of Timbre), available at https://doi.org/10.5281/zenodo.17830609 under CC BY 4.0

详情
AI中文摘要

本数据报告介绍了PolSeT(波兰语语义音色)数据集,该数据集旨在促进波兰语及跨文化背景下的心理声学和音乐信息检索(MIR)研究。数据集包含两个连续实验的数据。实验1(N=60)是一项自由言语化任务,旨在创建波兰语语义描述符词汇表。使用11个刺激,共收集了1901个描述符(701个唯一)。实验2(N=105)利用该词汇表进行语义差异研究,参与者对18种乐器声音在8个双极量表上进行评分,并进行了重复试验以进行信度分析。发布的数据集包括原始听众响应、全面的人口统计数据(经验、性别、年龄)、音频刺激以及提取的声学特征及Python提取代码。该数据集填补了开放音色研究数据的空白,为心理声学研究和多语言语义嵌入模型的训练提供了必要的定性语言基础和定量评分。

英文摘要

This data report introduces PolSeT (Polish Semantic Timbre), a dataset designed to facilitate research in psychoacoustics and Music Information Retrieval (MIR) in Polish and cross-cultural contexts. The dataset contains data from two sequential experiments. Experiment 1 (N=60) was a free-verbalization task aimed at creating a lexicon of Polish semantic descriptors. Using 11 stimuli, a total of 1901 descriptors (701 unique) were gathered. Experiment 2 (N=105) utilized this lexicon to conduct a semantic differential study, where participants rated 18 instrument sounds on 8 bipolar scales, with repeated trials for reliability analysis. The released dataset includes raw listener responses, comprehensive demographics (experience, gender, age), audio stimuli, and extracted acoustic features with Python extraction code. This dataset addresses a gap in open timbre research data, providing both the qualitative linguistic groundwork and the quantitative ratings necessary for psychoacoustic research and the training of multilingual semantic embedding models.

7. 安全、隐私与深度伪造音频 1 篇

2606.19579 2026-06-19 cs.SD cs.AI 新提交

FlowFake: Liquid Networks for Audio Deepfake Detection

FlowFake: 用于音频深度伪造检测的液态网络

Shivaay Dhondiyal, Divyansh Sharma, Dinesh Kumar Vishwakarma

发表机构 * Delhi Technological University(德里理工大学)

AI总结 针对音频深度伪造检测中跨数据集泛化失败的问题,提出基于液态时间常数(LTC)架构的FlowFake模型,通过学习ODE演化隐藏状态并自适应时间常数,以34K参数在跨域基准上超越现有方法。

Comments Accepted at the Workshop on Learning to Listen: Machine Learning for Audio at ICML 2026

详情
AI中文摘要

由神经文本转语音和语音克隆系统生成的音频深度伪造对说话人验证和公共话语构成大规模威胁。核心挑战是跨数据集泛化:在一种合成流水线上训练的检测器在面对未见过的伪造时性能崩溃。我们认为这种失败主要是由于结构性合成语音伪影,这些伪影是多时间尺度的轨迹异常。尽管每个现有检测器都聚合固定窗口的帧统计量,但这使得架构与信号不对齐。我们提出FlowFake,一种液态时间常数(LTC)架构,其隐藏状态通过学习ODE演化,每个神经元具有自适应时间常数,同时解析频谱(10ms)和韵律(2s)线索。仅34K参数,FlowFake实现了正式的BIBO稳定性和O(dt^4)积分误差。在四个数据集的跨域基准(ASVspoof2019-LA、FakeOrReal、InTheWild、MLAAD)上,FlowFake在仅用FakeOrReal训练时在ASVspoof2019上达到75.29%,仅用MLAAD训练时达到79.97%。它在每个评估对上优于RawGAT-ST和Whisper-DF,并以0.01%的参数数量匹配SSL Wav2vec2(大300倍)。源代码可在以下网址获取:this https URL

英文摘要

Audio deepfakes generated by neural text-to-speech and voice-cloning systems threaten speaker verification and public discourse at scale. The core challenge is cross-dataset generalization: detectors trained on one synthesis pipeline collapse on unseen forgeries. We argue that this failure is primarily because of structural synthetic speech artifacts which are multi-timescale trajectory anomalies. Though every existing detector aggregates a fixed-window frame statistics, this misaligns the architecture with the signal. We propose FlowFake, a Liquid Time-Constant (LTC) architecture whose hidden state evolves via a learned ODE, with per-neuron adaptive time constants simultaneously resolving spectral (10ms) and prosodic (2s) cues. At only 34K parameters FlowFake achieves formal BIBO stability and O(dt^4) integration error. On a four-dataset cross domain benchmark (ASVspoof2019-LA, FakeOrReal, InTheWild, MLAAD), FlowFake reaches 75.29% on ASVspoof2019 trained only on FakeOrReal and 79.97% trained only on MLAAD. It outperforms RawGAT-ST and Whisper-DF on every evaluated pair and matching SSL Wav2vec2 (300x larger) at 0.01% of its parameter count. The source code is available on : https://github.com/GhostRider2023/FlowFake