arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 语音识别与关键词检测 2 篇

2606.13253 2026-06-12 cs.SD cs.AI 新提交

Towards Personalized Federated Learning for Dysarthric Speech Recognition

面向构音障碍语音识别的个性化联邦学习

Tao Zhong, Mengzhe Geng, Jiajun Deng, Shujie Hu, Xunying Liu

发表机构 * The Chinese University of Hong Kong(香港中文大学) National Research Council Canada(加拿大国家研究委员会)

AI总结 针对构音障碍语音识别中联邦学习异构性问题,提出参数平均和嵌入平均两种个性化聚合策略,在UASpeech和TORGO上分别实现0.99%和0.56%的绝对词错误率降低。

详情
AI中文摘要

构音障碍者的语音识别具有挑战性。虽然基于联邦学习的ASR可以有效保护隐私,但它面临由说话人变异性引起的异构性问题。在这种异构性下,强制所有说话人共享相同的模型组件可能不是最优的,因此个性化是一个有前景的方向;然而,关于构音障碍语音的相关研究仍然有限。为此,本文探索了两种实现个性化的聚合策略,包括基于参数的平均策略和基于嵌入的平均策略。在UASpeech和TORGO上的实验表明,所提方法优于基线正则化FedAvg,在UASpeech上实现了高达0.99%绝对(3.15%相对)的统计显著词错误率降低,在TORGO上实现了0.56%绝对(4.73%相对)的降低。

英文摘要

Speech recognition is challenging for dysarthric speakers. While federated learning (FL)-based ASR can be an effective tool for protecting privacy, it suffers from heterogeneity issues caused by speaker variability. Forcing all speakers to share the same model components can be suboptimal under such heterogeneity, making personalization a promising direction; however, related research on dysarthric speech remains limited. To this end, this paper explores two aggregation strategies to achieve personalization, including the parameter-based averaging strategy and the embedding-based averaging strategy. Experiments on UASpeech and TORGO show that the proposed methods outperform the baseline regularized FedAvg by statistically significant WER reductions of up to 0.99% absolute (3.15% relative) on UASpeech and 0.56% absolute (4.73% relative) on TORGO, respectively.

2606.10231 2026-06-12 eess.AS cs.SD 版本更新

LLM can Read Spectrogram: Encoder-free Speech-Language Modeling

LLM 能读频谱图:无编码器的语音语言建模

Ruchao Fan, Yiming Wang, Yuxuan Hu, Bo Ren, Yufei Xia, Xiaofei Wang, Yao Qian, Shujie Liu, Jinyu Li

AI总结 提出 Mel-LLM,一种无需专用语音编码器、直接将梅尔频谱图补丁通过线性投影输入 LLM 的架构,在 ASR 和 TTS 任务上验证了其可行性,ASR 性能与有编码器方案相当,TTS 初步可行。

详情
AI中文摘要

最近的语音感知大语言模型(Speech-LLMs)依赖预训练的语音编码器将音频转换为 LLM 可消费的语义丰富表示。相反,在这项工作中,我们探索:LLM 能否直接学习读取梅尔频谱图,而无需专用的语音编码器?我们提出 Mel-LLM,一种无编码器的 Speech-LLM,它将经过轻量预处理的梅尔频谱图补丁通过线性投影直接输入 LLM,使 LLM 仅通过自身参数学习语音-文本对齐。我们在自动语音识别(ASR)和文本到语音(TTS)任务上进行了大量实验。对于 ASR,我们在 OpenASR 排行榜公开集和生产级扩展实验上评估,表明无编码器方案在性能上具有竞争力,与有编码器初始化的对应方案相比仅有有限退化。我们发现,当数据有限时,从多模态检查点(Phi-4-MM)初始化对于保持性能至关重要。我们还进行了消融研究,揭示了哪些 LLM 层与语音编码相关性较低。对于 TTS,我们展示了使用下一个令牌 VAE 方法的初步结果。虽然 TTS 性能尚未达到最优,但这些结果确立了用于自回归语音-文本建模的完全统一无编码器架构的可行性。

英文摘要

Recent speech-aware large language models (Speech-LLMs) rely on a pre-trained speech encoder to convert audio into semantic-rich representations consumable by LLM. In this work, instead, we explore: can an LLM learn to read Mel spectrogram directly without a dedicated speech encoder? We propose Mel-LLM, an encoder-free Speech-LLM that feeds lightly pre-processed Mel spectrogram patches directly into the LLM through a linear projection, allowing the LLM to learn speech-text alignment purely through its own parameters. We conduct extensive experiments on both automatic speech recognition (ASR) and text-to-speech (TTS) tasks. For ASR, we evaluate on the OpenASR leaderboard public sets and production-level scaling experiments, demonstrating that the encoder-free solution achieves competitive performance with only limited degradation compared to encoder-initialized counterparts. We find that when data is limited, initialization from a multimodal checkpoint (Phi-4-MM) is crucial for maintaining performance. We also present ablation studies revealing which LLM layers are less relevant to speech encoding. For TTS, we show preliminary results with a next-token VAE approach. While TTS performance is not yet optimal, these results establish the feasibility of a fully unified encoder-free architecture for autoregressive speech-text modeling.

2. 语音合成与声音生成 3 篇

2606.12555 2026-06-12 cs.SD cs.CV cs.MM 新提交

AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation

AudioX-Turbo:高效任意到音频生成的统一框架

Zeyue Tian, Lei Ke, Zhaoyang Liu, Ruibin Yuan, Liumeng Xue, Yujiu Yang, Weijia Chen, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Tsinghua University(清华大学) Noiz AI Independent Researcher(独立研究员)

AI总结 提出AudioX-Turbo,基于教师-学生范式的统一高效框架,通过多模态扩散Transformer和分布匹配蒸馏实现文本、视频、音频到音频的生成,仅需4步采样,NFE减少约25倍。

详情
AI中文摘要

基于灵活的多模态控制信号生成音频和音乐是一个广泛适用的课题,面临以下关键挑战:1) 统一的多模态建模框架,2) 大规模、高质量的训练数据,3) 多步扩散采样的高昂推理成本。为此,我们提出AudioX-Turbo,一个统一且高效的任意到音频生成框架,集成了多种多模态条件(即文本、视频和音频信号)。AudioX-Turbo遵循教师-学生范式。教师模型AudioX-Base基于多模态扩散Transformer,并带有模态自适应融合模块,用于对齐多样化的多模态输入以实现高保真合成,然后通过适用于流匹配的分布匹配蒸馏将其蒸馏为少步学生模型AudioX-Turbo,并辅以基于扩散的判别器以实现高质量的少步生成。为支持AudioX-Turbo的训练,我们构建了一个大规模、高质量的数据集IF-caps-Pro,包含约920万个样本,通过两阶段数据收集和标注流程整理而成。我们在广泛的任务上对AudioX-Turbo进行基准测试,发现我们的模型实现了优越的性能,尤其是在文本到音频和文本到音乐生成方面,同时仅需4个采样步骤,所需的函数评估次数(NFE)比多步基线减少约25倍。这些结果表明,我们的方法能够在灵活的多模态控制下进行音频生成,展现出高效且强大的指令跟随能力。代码和数据集将在https://this URL上提供。

英文摘要

Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, 2) large-scale, high-quality training data, and 3) the prohibitive inference cost of multi-step diffusion sampling. As such, we propose AudioX-Turbo, a unified and efficient framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. AudioX-Turbo follows a teacher-student paradigm. The teacher AudioX-Base is built on a Multimodal Diffusion Transformer with a Multimodal Adaptive Fusion module that aligns diverse multimodal inputs for high-fidelity synthesis, and is then distilled into the few-step student AudioX-Turbo via Distribution Matching Distillation adapted to flow matching, complemented by a diffusion-based discriminator for high-quality few-step generation. To support the training of AudioX-Turbo, we construct a large-scale, high-quality dataset, IF-caps-Pro, comprising approximately 9.2M samples curated through a two-stage data collection and annotation pipeline. We benchmark AudioX-Turbo across a wide range of tasks, finding that our model achieves superior performance, especially on text-to-audio and text-to-music generation, while operating at only 4 sampling steps and requiring approximately 25x fewer function evaluations (NFE) than multi-step baselines. These results demonstrate that our method is capable of audio generation under flexible multimodal control, showing efficient and powerful instruction-following capabilities. The code and datasets will be available at https://zeyuet.github.io/AudioX-Turbo/.

2606.12940 2026-06-12 cs.SD cs.LG 新提交

Self-Guidance: Enhancing Neural Codecs via Decoder Manifold Alignment

自引导:通过解码器流形对齐增强神经编解码器

Xiang Li, Yixuan Zhou, Jingran Xie, Zhiyong Wu, Hui Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出自引导方法,通过轻量特征映射损失对齐解码器内部流形,在不改变推理过程下提升VQ-VAE神经语音编解码器重建质量,实现低比特率SOTA性能并支持4倍码本缩减。

Comments 20 pages, 9 figures, accepted to ICML 2026, demo website available at https://sgvqvae.github.io/sgvqvae-demo

详情
AI中文摘要

基于向量量化VAE(VQ-VAE)的神经语音编解码器是语音大语言模型的核心音频分词器,但其重建保真度受限于量化误差。常见的修复方法是修改量化器或增加模型容量,但这会复杂化下游语言建模。我们的核心思想是,在处理量化标记及其原始连续嵌入时,使用轻量级特征映射损失对齐解码器的内部特征流形。这需要最小的训练开销,且无需改变推理过程。应用于XCodec2时,自引导改善了所有重建指标,实现了低比特率下的最先进性能。值得注意的是,它实现了4倍码本缩减而无保真度损失,下游TTS实验表明,通过简化标记建模空间,这显著改善了基于LLM的合成。多项统计观察和可视化证实了解码器中内部流形对齐的增强。大量实验证实了其在各种归纳偏置下的通用性。因此,自引导建立了一种高效、广泛适用的高保真神经音频编码方法。

英文摘要

Neural speech codecs based on Vector-Quantized VAEs (VQ-VAEs) are core audio tokenizers for speech LLMs, yet their reconstruction fidelity is bottlenecked by quantization error. Modifying the quantizer or increasing model capacity are common fixes, but they complicate downstream language modeling. Our core idea is to align the decoder's internal feature manifolds when processing both the quantized tokens and their original continuous embeddings, using a lightweight feature-mapping loss. This requires minimal training overhead and no inference-time changes. Applied to XCodec2, self-guidance improves all reconstruction metrics, achieving state-of-the-art low-bitrate performance. Notably, it enables a 4x codebook reduction without fidelity loss, which downstream TTS experiments show significantly improves LLM-based synthesis by simplifying the token modeling space. Multiple statistical observations and visualizations corroborate the enhanced internal manifold alignment in the decoder. Extensive experiments confirm its generality across various inductive biases. Self-guidance thus establishes an efficient, broadly applicable method for high-fidelity neural audio coding.

2606.13006 2026-06-12 cs.SD 新提交

Emo-LiPO: Listwise Preference Optimization for Fine-Grained Emotion Intensity Control in LLM-based Text-to-Speech

Emo-LiPO:基于LLM的文本到语音中细粒度情感强度控制的列表式偏好优化

Yihang Lin, Li Zhou, Congwei Cao, Dongchu Xie, Xiaoxue Gao, Chen Zhang, Haizhou Li

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Agency for Science, Technology and Research(新加坡科技研究局) National University of Singapore(新加坡国立大学) Shenzhen Research Institute of Big Data(深圳市大数据研究院) Shenzhen Loop Area Institute(深圳市环区研究院)

AI总结 提出Emo-LiPO框架,将情感强度控制建模为学习排序问题,通过列表式偏好优化对齐文本与语音的情感强度,实现更忠实连续的情感表达,在ESD-plus数据集上显著提升情感准确性和强度可控性。

Comments Accepted by IJCAI 2026. Emotional TTS, Preference Optimization, Emotion Intensity Control

详情
AI中文摘要

基于大型语言模型(LLM)的文本到语音(TTS)系统能够实现提示条件的情感控制,但由于文本与语音之间的语义-声学差距,在细粒度情感强度方面存在困难。为了解决这一挑战,我们将LLM-based TTS中的情感强度控制形式化为一个学习排序问题,并提出了Emo-LiPO,一种列表式偏好优化框架,该框架将提示条件的语音生成与文本中表达的相对情感强度对齐。Emo-LiPO在固定文本下显式建模每种情感内的全局强度排序,从而实现更忠实和连续的情感表达。我们进一步构建了ESD-plus,一个具有显式情感强度变化的多说话人数据集,以支持细粒度情感建模和评估。在ESD-plus上的实验表明,与基于监督学习和DPO的LLM TTS基线相比,Emo-LiPO显著提高了情感准确性和强度可控性,特别是在高强度水平上表现尤为突出。

英文摘要

Large language model (LLM)-based text-to-speech (TTS) systems enable prompt-conditioned emotional control but struggle with fine-grained emotion intensity due to the semantic -- acoustic gap between text and speech. To address this challenge, we formulate emotion intensity control in LLM-based TTS as a learning-to-rank problem and propose Emo-LiPO, a listwise preference optimization framework that aligns prompt-conditioned speech generation with relative emotion intensity expressed in text. Emo-LiPO explicitly models global intensity ordering within each emotion under fixed transcripts, enabling more faithful and continuous emotional expression. We further construct ESD-plus, a multi-speaker dataset with explicit emotion intensity variations, to support fine-grained emotion modeling and evaluation. Experiments on ESD-plus demonstrate that Emo-LiPO significantly improves emotion accuracy and intensity controllability over both supervised- and DPO-based LLM TTS baselines, with particularly pronounced gains at high intensity levels.

3. 说话人识别、验证与分离 2 篇

2606.12495 2026-06-12 cs.SD 新提交

Missing-Token Prompted Reliability-Aware Fusion for Robust Polyglot Speaker Identification

缺失令牌提示的可靠性感知融合用于鲁棒多语种说话人识别

Peng Jia, Li Dai, Jia Li, Zhenzhen Hu, Ye Zhao, Richang Hong

发表机构 * Hefei University of Technology(合肥工业大学) Intelligent Interconnected Systems Laboratory of Anhui Province(安徽省智能互联系统实验室)

AI总结 提出MRAF框架,通过可学习的缺失令牌和可靠性感知交叉注意力融合,解决多语种场景下跨语言泛化和人脸缺失时的鲁棒性问题,在POLY-SIM 2026测试集上取得高准确率。

Comments 8 pages, 3 figures, 4 tables

详情
AI中文摘要

准确且鲁棒的多模态说话人识别对于多媒体理解和生物特征认证至关重要。然而,现实中的多语种场景带来了两个关键挑战:说话人判别性表示应跨语言泛化,并且当人脸信息不可用时模型应保持可靠。为了解决这些挑战,我们提出了MRAF,一个缺失令牌提示的可靠性感知融合框架,用于跨完整模态、缺失人脸和跨语言场景的多语种说话人识别。MRAF用可学习的缺失令牌代替固定的零值特征来表示不可用的人脸输入,提供了缺失视觉状态的可训练表示。这种设计减少了由缺失输入引起的分布差距,并允许后续的可靠性估计和跨模态融合在统一的令牌空间内操作。为了自适应地集成具有不同可靠性的模态,MRAF进一步引入了可靠性感知的交叉注意力融合模块,该模块估计人脸和音频的可靠性分数,将其归一化为模态权重,并在双向交叉注意力之前将这些权重应用于令牌表示。这样,模型可以强调可靠的模态线索,同时抑制不可靠的。在训练过程中,MRAF联合优化多分支分类损失、仅音频知识蒸馏和中心损失,以提高说话人判别性和缺失模态鲁棒性。在官方POLY-SIM 2026测试集上的实验证明了所提出框架的有效性。在最终评估中,MRAF在P3和P5上达到了100%的准确率,并在更具挑战性的缺失人脸设置P4和P6上获得了有竞争力的结果。源代码将在https://this URL发布。

英文摘要

Accurate and robust multimodal speaker identification is essential for multimedia understanding and biometric authentication. However, real-world polyglot scenarios pose two key challenges: speaker-discriminative representations should generalize across languages, and the model should remain reliable when face information is unavailable. To address these challenges, we propose MRAF, a Missing-Token Prompted Reliability-Aware Fusion framework for polyglot speaker identification across complete-modality, missing-face, and cross-lingual scenarios. MRAF represents unavailable face inputs with a learnable missing token instead of fixed zero-valued features, providing a trainable representation of the missing visual state. This design reduces the distribution gap caused by missing inputs and allows subsequent reliability estimation and cross-modal fusion to operate within a unified token space. To adaptively integrate modalities with different reliability, MRAF further introduces a reliability-aware cross-attention fusion module, which estimates face and audio reliability scores, normalizes them into modality weights, and applies these weights to token representations before bidirectional cross-attention. In this way, the model can emphasize reliable modality cues while suppressing unreliable ones. During training, MRAF jointly optimizes multi-branch classification losses, audio-only knowledge distillation, and center loss to improve speaker discrimination and missing-modality robustness. Experiments on the official POLY-SIM 2026 test set demonstrate the effectiveness of the proposed framework. In the final evaluation, MRAF achieves 100% accuracy on P3 and P5, and obtains competitive results on the more challenging missing-face settings P4 and P6. The source code will be released at https://github.com/MSA-LMC/MRAF.

2606.13095 2026-06-12 eess.AS cs.SD 交叉投稿

Balancing ASR and diarization in end-to-end LLMs for multi-talker speech recognition

在端到端大语言模型中平衡ASR与说话人日志以进行多说话人语音识别

Naijun Zheng, Yuke Lin, Sanli Tian, Mengtian Li, Zhiwei Lin, Longshuai Xiao, Dandan Tu

AI总结 提出双编码器架构、特征交错格式、长度感知说话人ID损失和自适应阈值ASR损失策略,在有限真实数据下高效训练LLM系统,平衡ASR与说话人日志任务,在AliMeeting和Aishell4语料库上分别实现18%和24%的相对改进。

Comments Accepted in Interspeech 2026

详情
AI中文摘要

多说话人语音识别通常通过结合自动语音识别(ASR)和说话人日志的流水线系统来处理。最近,基于大语言模型(LLM)的方法通过联合建模语义和说话人信息显示出前景,但它们通常需要大规模的多说话人语料库,而标注这些语料库成本高昂。在本文中,我们研究了如何在有限真实录音数据下高效训练基于LLM的系统,同时保持说话人归属的高准确性。我们提出了几种策略:(1)双编码器架构,用于提取语义和说话人特征;(2)特征交错格式,将这些特征合并作为LLM的输入;(3)长度感知的说话人ID损失,以增强日志能力;(4)自适应阈值的ASR损失计算,以减轻语音重叠引起的幻觉。这些策略平衡了ASR和说话人日志任务之间的训练。我们的系统优于开源基线方法,在AliMeeting语料库上实现了18%的相对改进,在Aishell4语料库上实现了24%的相对改进。

英文摘要

Multi-talker speech recognition is often addressed by combining automatic speech recognition (ASR) and speaker diarization in a pipeline system. Recently, LLM-based approaches have shown promise by jointly modeling semantic and speaker information, but they typically require large-scale multi-talker corpora that are costly to annotate. In this paper, we investigate how to efficiently train an LLM-based system with limited real-recorded data while maintaining high accuracy in speaker attribution. We propose several strategies: (1) a dual-encoder architecture to extract semantic and speaker features, (2) a feature interleaving format to merge these features as the inputs to the LLM, (3) a length-aware speaker ID loss to enhance diarization capability, and (4) an adaptive threshold strategy for ASR loss computation to mitigate hallucinations caused by speech overlaps. These strategies balance training between ASR and diarization tasks. Our system outperforms open-source baseline approaches, achieving relative improvements of 18% on the AliMeeting corpus and 24% on the Aishell4 corpus.

4. 语音增强、降噪与音频修复 2 篇

2606.12662 2026-06-12 cs.SD cs.AI cs.LG 新提交

BASENet: Band-Adapted Speech Enhancement Network with Cross-Band Attention

BASENet: 基于频带自适应的跨频带注意力语音增强网络

Damien Martins Gomes, François Capman

发表机构 * Thales SIX GTS, FRANCE(泰雷兹SIX GTS公司,法国)

AI总结 提出BASENet,通过Bark尺度划分频带并分配自适应容量编码器,结合跨频带注意力模块,以最少参数实现高PESQ和STOI,适用于资源受限设备。

详情
AI中文摘要

语音增强模型通常对所有频率采用统一容量,忽略了人类听觉的非均匀频谱分辨率。我们提出BASENet,一种频率自适应架构,将频谱划分为Bark尺度频带,并为每个频带分配基于临界频带密度的缩放容量编码器,自动为感知密集的低频分配更深的分支,为高频分配更轻的分支。跨频带注意力模块通过紧凑的频率池化表示以线性复杂度捕获跨频带的谐波依赖性。基于具有密集连接的倒残差块和卷积循环网络,BASENet在VoiceBank+DEMAND上以仅0.83M参数和7.3 G MACs达到3.55 PESQ和STOI~96%,是所有PESQ > 3.50方法中参数最少的。因果变体(3.44 PESQ)超过了几种非因果基线,证实了其在资源受限设备上实时流传输的适用性。

英文摘要

Speech enhancement models typically apply uniform capacity across all frequencies, disregarding the non-uniform spectral resolution of human hearing. We propose BASENet, a frequency-adapted architecture that partitions the spectrum into Bark-scale bands and assigns each a scaled-capacity encoder derived from critical-band density, automatically granting deeper branches to perceptually dense low frequencies and lighter ones to high frequencies. A cross-band attention module captures harmonic dependencies across bands through compact frequency-pooled representations at linear complexity. Built on inverted residual blocks with dense connectivity and a convolutional recurrent network, BASENet achieves 3.55 PESQ and STOI~96% on VoiceBank+DEMAND with only 0.83M parameters and 7.3 G~MACs, the fewest parameters among all methods with PESQ > 3.50. A causal variant (3.44 PESQ) surpasses several non-causal baselines, confirming suitability for real-time streaming on resource-constrained devices.

2606.13109 2026-06-12 eess.AS cs.SD 交叉投稿

Generating Training Targets for Real-World Speech Enhancement via Close-to-Distant Microphone Projection

为真实场景语音增强生成训练目标:通过近远麦克风投影

Tomohiro Nakatani, Rintaro Ikeshita, Naoyuki Kamo, Marc Delcroix, Shoko Araki

AI总结 提出近远麦克风投影(C2D投影)方法,利用真实录音生成配对数据,通过参数化多通道维纳滤波器实现投影,训练神经网络在远场语音增强中优于现有GSS方法。

详情
Journal ref
Proceedings of IEEE ICASSP 2026
AI中文摘要

在远距离语音捕获场景中训练语音增强(SE)神经网络需要配对的失真和干净参考语音信号。虽然此类数据通常通过模拟生成,但模拟与真实录音之间的不匹配显著限制了SE的准确性。为解决此问题,我们提出近远麦克风投影(C2D投影),一种从近距离和远距离麦克风捕获的真实录音中生成配对数据的方法。C2D投影估计一个最优投影矩阵,将近麦克风输入转换为与远麦克风录音对齐的干净参考信号,同时执行去噪。我们证明,使用参数化多通道维纳滤波器(PMWF)的变体可以有效地实现这种投影。实验结果表明,在具有挑战性的CHiME6晚宴派对ASR任务中,使用C2D投影数据训练的神经网络在oracle说话人日志条件下,当使用GSS的增强输出作为神经网络的辅助输入时,优于最先进的引导源分离(GSS)。

英文摘要

Training neural networks (NNs) for speech enhancement (SE) in distant speech-capturing scenarios requires paired distorted and clean reference speech signals. While such data are often generated through simulation, the mismatch between simulated and real recordings significantly limits SE accuracy. To address this issue, we propose Close-to-Distant microphone Projection (C2D projection), a method that generates paired data from real recordings captured by close and distant microphones. C2D projection estimates an optimal projection matrix that transforms close-microphone inputs into clean reference signals aligned with distant-microphone recordings, while simultaneously performing denoising. We show this projection can be effectively realized using a variant of the Parametric Multichannel Wiener Filter (PMWF). Experimental results demonstrate that an NN trained with C2D-projected data outperforms the state-of-the-art Guided Source Separation (GSS) on the challenging CHiME6 dinner party ASR task under oracle diarization, when using the enhanced output from GSS as an auxiliary input to the NN.

5. 音频事件检测与场景理解 3 篇

2606.12503 2026-06-12 cs.LG cs.SD 交叉投稿

Dolph2Vec: Self-Supervised Representations of Dolphin Vocalizations

Dolph2Vec: 海豚发声的自监督表示

Chiara Semenzin, Faadil Mustun, Roberto Dessi, Pierre Orhan, Alexis Emanuelli, Yair Lakretz, Gonzalo de Polavieja, German Sumbre

发表机构 * École Normale Supérieure, Paris, France(巴黎高等师范学院) Not Diamond, San Francisco, USA(Not Diamond公司) Institut du Cerveau, Paris, France(巴黎脑研究所) Champalimaud Foundation, Lisbon, Portugal(尚帕利莫基金会)

AI总结 提出Dolph2Vec,首个基于五年纵向海豚录音数据训练的自监督模型,在签名哨声分类和检测任务上显著优于通用基线,并发现可解释的声学单元。

详情
AI中文摘要

自监督学习(SSL)通过无需昂贵人工标注即可对动物发声进行可扩展建模,为生物声学开辟了新机遇。然而,当前该领域的SSL模型优先考虑跨物种的广泛泛化,并未针对揭示个体通信系统的细粒度结构进行优化。在这项工作中,我们收集并发布了一个新颖的数据集,包含来自半自然海洋环境中五只已知海豚的超过五年的纵向录音,这是研究海豚通信的前所未有的资源。我们将Wav2Vec2.0 Baevski等人(2020)的架构适应于此领域,并引入Dolph2Vec,这是第一个仅在此数据上训练的大规模、物种特异性SSL模型。我们在两个生物学相关任务上对模型进行基准测试:签名哨声分类和哨声检测。Dolph2Vec在这两个任务上均显著优于通用基线。除了性能,我们还展示了学习到的嵌入和码本结构捕获了与海豚哨声类别以及可能的子哨声结构对齐的可解释声学单元,从而能够对通信模式进行细粒度分析。我们的发现证明了SSL如何作为模型和科学工具来探索动物通信研究中的假设。

英文摘要

Self-supervised learning (SSL) has opened new opportunities in bioacoustics by enabling scalable modeling of animal vocalizations without the need for expensive manual annotation. However, current SSL models in this domain prioritize broad generalization across species and are not optimized for uncovering the fine-grained structure of individual communication systems. In this work, we collect and release a novel dataset of over five years of longitudinal recordings, from five known dolphins in a semi-naturalistic marine environment, an unprecedented resource for studying dolphin communication. We adapt the Wav2Vec2.0 Baevski et al. (2020) architecture to this domain and introduce Dolph2Vec, the first large-scale, species-specific SSL model trained exclusively on this data. We benchmark our model on two biologically relevant tasks: signature whistle classification and whistle detection. Dolph2Vec significantly outperforms general-purpose baselines in both tasks. Beyond performance, we show that learned embeddings and codebook structure capture interpretable acoustic units aligned with dolphin whistle categories and possibly sub-whistle structure, enabling fine-grained analysis of communication patterns. Our findings demonstrate how SSL can serve as both a model and a scientific tool to explore hypotheses in animal communication research.

2606.13236 2026-06-12 cs.LG cs.AI cs.SD stat.AP 交叉投稿

Decoding Insect Song: A Multitask Semisupervised Orthoptera Bioacoustic Classifier

解码昆虫之歌:一种多任务半监督直翅目生物声学分类器

Olga Isupova, Danil Kuzin, Ella Browning, Tom Mills, Steven Reece

发表机构 * University of Oxford(牛津大学)

AI总结 提出PULSE半监督多任务框架,结合弱监督分类、自监督学习和知识蒸馏,在直翅目生物声学分类中优于通用模型,并通过主动学习进一步提升性能。

Comments ICML 2026 Workshop on Machine Learning for Audio

详情
AI中文摘要

被动声学监测在生态推断方面具有巨大潜力,但现有的自动化工具通常训练范围狭窄且不可迁移。我们通过PULSE(一种用于直翅目生物声学的半监督多任务框架)解决了这些局限性,该框架结合了弱监督物种分类、未标记野外音频的自监督学习以及来自通用生物声学模型的知识蒸馏。我们的领域自适应专家模型在所有指标上均优于最先进的通用模型(宏F1:0.21 vs. 0.07;AUC:0.74 vs. 0.45;AP:0.32 vs. 0.19),主动学习进一步将F1提升至0.34,AUC提升至0.84。除了分类之外,学习到的嵌入编码了生态上有意义的结构,并通过交互式可视化工具暴露出来,用于生态发现。

英文摘要

Passive acoustic monitoring holds great promise for ecological inference, yet existing automated tools are typically narrowly trained and non-transferable. We address these limitations with PULSE, a semi-supervised, multi-task framework for Orthoptera bioacoustics, combining weakly-supervised species classification, self-supervised learning on unlabelled field audio, and knowledge distillation from a general-purpose bioacoustic model. Our domain-adapted specialist model outperforms a state-of-the-art general model across all metrics (macro F1: 0.21 vs. 0.07; AUC: 0.74 vs. 0.45; AP: 0.32 vs. 0.19), with active learning further raising F1 to 0.34 and AUC to 0.84. Beyond classification, the learned embeddings encode ecologically meaningful structure, exposed through an interactive visualisation tool for ecological discovery.

2509.04682 2026-06-12 cs.SD cs.AI cs.CV cs.IR cs.LG eess.AS 版本更新

GetNetUPAM: Ecologically Informed Nested Cross-Validation and Noise-Robust Attention for Marine Bioacoustic Monitoring

GetNetUPAM:生态信息嵌套交叉验证与噪声鲁棒注意力用于海洋生物声学监测

Nicholas R. Rasmussen, Rodrigue Rizk, Longwei Wang, KC Santosh

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出GetNetUPAM框架,通过分层嵌套交叉验证保持生态异质性,并集成CBAM空间注意力的ARPA-N网络,在高噪声低信噪比条件下实现鲁棒泛化,在零训练区域将误报率降低约10倍。

Comments Resubmitted and under review as an anonymous submission to IEEETAI - We are allowed an archive submission. Final formatting is yet to be determined

详情
AI中文摘要

部署可靠的生物声学监测系统需要能够在高噪声、低信噪比条件下泛化的模型,以及能够暴露部署相关故障模式的评估协议,这些在当前UPAM实践中基本未得到解决。内在噪声、可变传播以及混合的生物和人为源会导致分布偏移,而传统模型和单次划分评估会掩盖这些偏移,夸大性能并掩盖不稳定性。我们提出GetNetUPAM,一种分层嵌套交叉验证框架,它利用嵌套阶段来量化模型稳定性,而不是调整以获取夸大的保留分数。通过将数据划分为站点-年份块,GetNetUPAM保留了生态异质性,并迫使每个外层折代表不同的环境条件,防止过拟合局部噪声或传感器伪影。内层分层折衡量整个UPAM信号分布上的泛化能力,强制模型开发与外层保留部署条件严格分离。使用GetNetUPAM,我们评估了自适应分辨率池化和注意力网络(ARPA-N),一种用于不规则频谱图维度的CNN架构。ARPA-N将CBAM空间注意力集成为学习型噪声抑制器,生成注意力图以定位真实叫声结构,并避免标准CNN在长窗口数据上利用的全局非生物线索。在GetNetUPAM下,ARPA-N在不同环境条件下鲁棒泛化。在零训练的Balleny Islands区域,它在固定90%召回率下将每小时误报率降低超过一个数量级(约10倍),并在各折上持续改进指标。这些进展提供了可重复的基准,推动UPAM向可扩展、部署可靠的生态监测发展。

英文摘要

Deploying reliable bioacoustic monitoring systems requires models that generalize under high-noise, low-SNR conditions and evaluation protocols that expose deployment-relevant failure modes, gaps largely unaddressed in current UPAM practice. Intrinsic noise, variable propagation, and mixed biological and anthropogenic sources induce distribution shifts that conventional models and single-split evaluations obscure, inflating performance and masking instability. We introduce GetNetUPAM, a hierarchical nested cross-validation framework that uses the nested stage to quantify model stability rather than tune for inflated hold-out scores. By partitioning data into site-year blocks, GetNetUPAM preserves ecological heterogeneity and forces each outer fold to represent a distinct environmental regime, preventing overfitting to localized noise or sensor artifacts. Inner stratified folds measure generalization across the full UPAM signal distribution, enforcing strict separation between model development and the outer held-out deployment condition. Using GetNetUPAM, we evaluate the Adaptive Resolution Pooling and Attention Network (ARPA-N), a CNN architecture for irregular spectrogram dimensions. ARPA-N integrates CBAM spatial attention as a learned noise suppressor, producing attention maps that localize true call structure and avoid the global, non-biological cues exploited by standard CNNs on long-window data. Under GetNetUPAM, ARPA-N generalizes robustly across diverse environmental regimes. In the zero-training support Balleny Islands region, it reduces false positives per hour by over an order of magnitude (approximately 10x) at fixed 90 percent recall, yielding consistently improved metrics across folds. These advances provide a reproducible benchmark and move UPAM toward scalable, deployment-reliable ecological monitoring.

6. 音乐信息检索与音乐生成 1 篇

2606.13640 2026-06-12 cs.SD 新提交

The Moving Drone: Negotiating Agency Between the Voice and the Virtual

移动的无人机:在声音与虚拟之间协商能动性

Nithya Shikarpur, Victor Arul, Anna Huang

发表机构 * Massachusettes Institute of Technology(麻省理工学院) Harvard University(哈佛大学)

AI总结 基于印度斯坦音乐,通过Max/MSP循环器和生成式AI模型GaMaDHaNi,将传统静态无人机变为动态、主动的虚拟音乐代理,探讨人机协作中的能动性。

Comments Published in NIME music track 2026

详情
AI中文摘要

印度斯坦音乐中的旋律材料通常与一个主音相关联,该主音通常由坦布拉(一种四弦无人机乐器)持续维持。植根于印度斯坦音乐,《移动的无人机》将传统静态的无人机置于运动中,在表演过程中逐渐获得能动性,从反应性角色过渡到更主动的角色。该作品在Max/MSP中使用四个独立的循环器作为“虚拟”无人机。当歌手即兴演唱时,这些循环器实时循环填充,在声音与虚拟无人机之间创建一个有机且不断演变的反馈回路。这种关系通过音高移位循环进一步在旋律上演变,引入了突然、显式运动的维度。然后,通过集成GaMaDHaNi(一种经过歌手条件训练的语音到声音生成式AI模型)来重新合成循环音频,从而在音色上发生变化。虽然当前的音乐AI方法优先考虑生成内容的高保真度和逼真度,这引发了音乐界对工作替代的焦虑,但本作品有意使用低保真生成输出,进一步需要人类解释和情境背景才能完成。《移动的无人机》将技术和生成式AI置于既定的社会文化音乐实践中,提出虚拟无人机作为一种主动、响应性和共同创造的音乐代理。

英文摘要

Melodic material in Hindustani music is presented in relation to a tonic, usually sustained by the tanpura, a four-stringed drone instrument. Rooted in Hindustani music, 'The Moving Drone' sets the traditionally static drone into motion that, throughout the performance, gains increasing agency transitioning from reactive to more proactive roles. The work employs four independent loopers in Max/MSP to function as 'virtual' drones. They are populated cyclically in real-time as the vocalist improvises, creating an organic and evolving feedback loop between the voice and the virtual drone. This relationship further evolves melodically by pitch shifting the loops, which introduces a dimension of sudden, explicit movement. Then it changes timbrally, via the integration of GaMaDHaNi, a singer conditioned pitch-to-voice generative AI model to resynthesize looped audio. While current music AI approaches prioritize high-fidelity and realism of generated content which has sparked anxiety over job replacement for the music community, this work intentionally utilizes low-fidelity generative outputs, further necessitating human interpretation and situational context in order to be complete. 'The Moving Drone' positions technology and generative AI within established socio-cultural musical practices, proposing a virtual drone as an active, responsive, and co-creative musical agent.

7. 语音翻译与语音语言模型 2 篇

2606.13121 2026-06-12 cs.CL cs.AI cs.SD 交叉投稿

NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation

NaturalFlow: 减少同步语音到语音翻译中破坏自然语音流的停顿

Dongwook Lee, Youngho Cho, Sangkwon Park, Heeseung Kim, Sungroh Yoon

发表机构 * IPAI and ECE, Seoul National University(首尔大学IPAI与ECE) Department of AI, University of Seoul(首尔市立大学人工智能系)

AI总结 提出一个流畅性感知优化框架,通过利用模型内部信号(如语言多样性和语音时长的时间变异性)最小化块间静音,在同步翻译的低延迟和连续翻译的自然流畅之间找到平衡点。

Comments Proceedings of the 26th Interspeech Conference, Long Paper

详情
AI中文摘要

同步语音到语音翻译旨在通过最小化延迟实现近实时通信,为连续翻译的高延迟提供了一种引人注目的实时替代方案。然而,过度追求低延迟往往会导致碎片化的块状语音。因此,听众会遭受不自然的声学流,其中频繁的停顿可能会增加他们的认知负荷。为了弥补这一差距,我们引入了一个流畅性感知优化框架,旨在发现同步翻译的低延迟优势与连续翻译的自然流畅之间的最佳平衡点。我们的框架通过利用模型内部信号(包括语言多样性和语音时长的诱导时间变异性)来最小化块间静音。在短文本和长文本基准上的实验表明,我们的框架在保持竞争性延迟和翻译质量的同时,产生了自然的语音流。

英文摘要

Simultaneous speech-to-speech translation aims to enable near-real-time communication by minimizing latency, offering a compelling, real-time alternative to the high latency of consecutive translation. However, the excessive pursuit of low latency often results in fragmented chunk-wise speech. Consequently, listeners are subjected to an unnatural acoustic flow punctuated by frequent pauses, which could increase their cognitive load. To bridge this gap, we introduce a fluency-aware optimization framework designed to discover the sweet spot between the low-latency benefits of simultaneous translation and the natural flow of consecutive translation. Our framework minimizes inter-chunk silences by leveraging model-internal signals, including linguistic diversity and induced temporal variability in speech durations. Experiments on short- and long-form benchmarks show that our framework produces natural speech flow while maintaining competitive latency and translation quality.

2606.13450 2026-06-12 eess.AS cs.SD 交叉投稿

Endpoint Anticipation for Low-Latency Spoken Dialogue

低延迟口语对话的端点预测

Sathvik Udupa, Shinji Watanabe, Petr Schwarz, Jan Cernocky

AI总结 提出端点预测方法,通过提前预测对话结束信号实现低延迟,在部分上下文中投机执行LLM和TTS流水线,平均延迟降低505毫秒。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

虽然低延迟交互对于口语对话至关重要,但级联架构通常受限于反应式话轮结束检测。我们提出端点预测,从反应式检测转向主动预测结束信号。我们的基于语音的模型可提前最多2.56秒预测端点,从而能够在部分上下文中投机执行LLM和TTS流水线。我们引入指标来量化实现的延迟降低与计算冗余之间的权衡。在对话和任务导向数据集上的评估表明,我们的模型始终优于基于VAP的竞争基线。与Unmute框架的集成展示了平均延迟降低505毫秒,投机计算增加28.4%,有效掩盖了顺序瓶颈,从而在实时语音到语音交互中实现复杂推理。

英文摘要

While low-latency interaction is critical for spoken dialogue, cascaded architectures are often bottlenecked by reactive turn-completion detection. We propose Endpoint Anticipation, shifting from reactive detection to proactive forecasting of end-of-turn signals. Our speech-based model anticipates endpoints upto 2.56 seconds in advance, enabling speculative execution of LLM and TTS pipelines on partial context. We introduce metrics to quantify the trade-off between realized latency reduction and computational redundancy. Evaluation across conversational and task-oriented datasets shows our model consistently outperforms competitive VAP-based baselines. Integration with the Unmute framework demonstrates a 505 ms average latency reduction with a 28.4% increase in speculative computation, effectively masking sequential bottlenecks to enable complex reasoning in real-time speech-to-speech interaction.

8. 低资源、多语言与方言语音 1 篇

2606.11681 2026-06-12 cs.CL cs.SD 版本更新

UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

UR-BERT:通过通用罗马化和语音标记预测扩展大规模多语言TTS的文本编码器

Sangmin Lee, Eekgyun Ahn, Woongjib Choi, Hong-Goo Kang

发表机构 * Dept. of Electronics and Electrical Engineering, Yonsei University(延世大学电子与电气工程系)

AI总结 提出UR-BERT,一种基于罗马化转录的TTS编码器,通过统一书写系统为罗马化表示,结合语音标记预测目标,在495种语言上实现高效多语言TTS,优于现有基线并泛化到未见语言。

Comments Accepted to Interspeech 2026, Github: https://github.com/sanghyang00/ur-bert

详情
AI中文摘要

我们提出UR-BERT,一种基于罗马化转录的文本到语音(TTS)编码器,用于大规模多语言TTS系统。传统的字素到音素(G2P)方法由于可靠G2P资源的可用性,仅限于约100种语言。相比之下,UR-BERT通过将多样化的书写系统统一为共享的罗马化表示,扩展到495种语言。为了进一步增强语音保真度和文本-语音对齐,我们在训练过程中引入了一个语音标记预测目标,这促使编码器以数据高效的方式学习语音感知的语音表示。实验表明,基于UR-BERT构建的TTS系统在广泛的语言和资源条件下,始终优于最近的文本编码器基线,并展现出对未见语言的强大泛化能力。

英文摘要

We propose UR-BERT, a Romanized transcription-based text-to-speech (TTS) encoder for massively multilingual TTS systems. Conventional grapheme-to-phoneme (G2P)-based approaches are limited to around 100 languages due to the availability of reliable G2P resources. In contrast, UR-BERT scales to 495 languages by unifying diverse writing systems into a shared Romanization representation. To further enhance phonetic fidelity and text-speech alignment, we introduce a speech token prediction objective during training, which encourages the encoder to learn speech-aware phonetic representations in a data-efficient manner. Experiments show that TTS systems built on UR-BERT consistently outperform recent text encoder baselines across a wide range of languages and resource conditions, and demonstrate strong generalization to unseen languages.

9. 数据集、基准与评测 1 篇

2603.00610 2026-06-12 cs.SD cs.AI cs.LG cs.MM eess.AS 版本更新

CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

CMI-RewardBench: 基于组合多模态指令评估音乐奖励模型

Yinghao Ma, Haiwen Xia, Hewei Gao, Weixiong Chen, Yuxin Ye, Yuchen Yang, Sungkyun Chang, Mingshuo Ding, Yizhi Li, Ruibin Yuan, Simon Dixon, Emmanouil Benetos

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学) University of Cambridge(剑桥大学) University of Toronto(多伦多大学)

AI总结 针对音乐生成模型缺乏有效评估机制的问题,提出CMI-RewardBench基准,包含大规模偏好数据集和参数高效奖励模型,实现多模态指令下的音乐质量评估。

Comments Accepted by ICML 2026

详情
AI中文摘要

虽然音乐生成模型已经发展到能够处理混合文本、歌词和参考音频的复杂多模态输入,但评估机制却滞后了。在本文中,我们通过为组合多模态指令(CMI)下的音乐奖励建模建立了一个全面的生态系统来弥补这一关键差距,其中生成的音乐可能以文本描述、歌词和音频提示为条件。我们首先引入了CMI-Pref-Pseudo,一个包含11万个伪标签样本的大规模偏好数据集,以及CMI-Pref,一个针对细粒度对齐任务量身定制的高质量人工标注语料库。为了统一评估格局,我们提出了CMI-RewardBench,一个统一的基准,用于评估音乐奖励模型在音乐性、文本-音乐对齐和组合指令对齐方面的异质样本。利用这些资源,我们开发了CMI奖励模型(CMI-RMs),一个能够处理异质输入的参数高效奖励模型家族。我们评估了它们与人类判断分数在音乐性和对齐方面的相关性,使用了CMI-Pref以及之前的数据集。进一步的实验表明,CMI-RM不仅与人类判断高度相关,而且通过top-k过滤实现了有效的推理时扩展。代码可在GitHub(此 https URL )获取。模型权重:CMI-RM(此 https URL )。数据集:CMI-Pref-Pseudo(此 https URL )和CMI-Pref(此 https URL )。

英文摘要

While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgment scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. Code is available at GitHub (https://github.com/Haiwen-Xia/CMI-RewardBench). Model weights: CMI-RM (https://huggingface.co/HaiwenXia/CMI-RM). Datasets: CMI-Pref-Pseudo (https://huggingface.co/datasets/HaiwenXia/cmi-pref-pseudo) and CMI-Pref (https://huggingface.co/datasets/HaiwenXia/cmi-pref)

10. 安全、隐私与深度伪造音频 1 篇

2606.12812 2026-06-12 cs.CY cs.SD 交叉投稿

Vocal Identity Under Siege by AI Voice Cloning Technologies

AI语音克隆技术对声音身份的攻击

Jyh-An Lee, Xuan Sun

AI总结 本文通过比较分析公开权、人格权和个人数据保护权三种法律框架,探讨生成式AI语音克隆对声音身份独特价值的威胁及法律应对。

详情
Journal ref
[2026] Singapore Journal of Legal Studies 46
AI中文摘要

先进的AI驱动语音克隆的出现,将保护声音身份的关键法律和伦理挑战推到了前台。受近期争议(包括OpenAI的ChatGPT-4o语音与斯嘉丽·约翰逊声音惊人相似)的推动,本文探讨了生成式AI技术如何削弱人类声音的独特价值,并进一步复杂化围绕人格权的法律问题。通过比较分析,本文评估了三种主要法律框架:公开权、人格权和个人数据保护权。每种框架——根植于不同的法律传统——在应对AI生成语音克隆带来的威胁方面各有优势和局限。通过分析这些原则的范围、救济措施和死后保护,本研究为理解现有法律方法如何应用于生成式AI时代声音身份不断演变的挑战提供了基础。

英文摘要

The advent of sophisticated AI-driven voice cloning has brought to the fore critical legal and ethical challenges regarding the protection of vocal identity. Prompted by recent controversies - including the striking resemblance between OpenAI's ChatGPT-4o voice and that of Scarlett Johansson - this article examines how generative AI technologies undermine the unique value of the human voice and further complicate the legal questions surrounding personality right. Through a comparative analysis, the paper evaluates three principal legal frameworks: the right of publicity, personality rights, and the personal data protection right. Each framework - rooted in different legal traditions o offers distinct strengths and limitations in addressing the threats posed by AI-generated voice cloning. By analysing these doctrines' scope, remedies, and posthumous protections, the study offers a foundation for understanding how existing legal approaches may be applied to the evolving challenges of vocal identity in the era of generative AI.

11. 其他/综合语音音频 3 篇

2606.13193 2026-06-12 eess.AS cs.PL cs.SD 交叉投稿

A Dual-Mode Faust-to-CLAP Compilation System

双模式 Faust 到 CLAP 编译系统

Facundo Franchino, Stéphane Letz, Jatin Chowdhury

AI总结 提出 faust2clap 框架,支持静态编译和动态解释两种模式,通过地址身份匹配算法和稳定槽位分配方案解决 DSP 参数身份保持问题,实现高效编译与热更新。

Comments 4 pages, 4 figures, 1 algorithm. Presented at the International Faust Conference (IFC-26), Lyon, France, June 2026

详情
AI中文摘要

我们描述了 faust2clap,一个建立从 Faust DSP 规范到 CLAP 格式的首个官方维护编译路径的框架。该系统以两种不同模式运行。静态模式采用提前编译以生成最优效率的原生二进制文件,而动态模式使用运行时解释以允许在不中断宿主应用程序的情况下修改 DSP 代码。后一种能力解决了音频软件开发中一个长期存在的摩擦,即编辑、编译和重载循环的累积开销。我们详细阐述了两种模式背后的算法机制,特别关注参数身份问题。为了在结构 DSP 突变中保留参数值及其与宿主自动化的绑定,我们引入了一种基于地址的身份匹配算法和一种稳定的槽位分配方案。该实现包含约 2400 行 C++ 架构和 Python 工具代码,并已集成到 Faust 主发行版中。

英文摘要

We describe faust2clap, a framework establishing the first officially maintained compilation pathway from Faust DSP specifications to the CLAP format. The system operates in two different modes. A static mode employs ahead-of-time compilation to yield native binaries of optimal efficiency, while a dynamic mode uses runtime interpretation to permit DSP code modification without interrupting the host application. This latter capability addresses a persistent friction in audio software development, namely the cumulative overhead of the edit, compile, and reload cycle. We detail the algorithmic machinery underlying both modes, focusing specifically on the problem of parameter identity. To preserve both parameter values and their bindings to host automation across structural DSP mutations, we introduce an address-based identity matching algorithm and a stable slot allocation scheme. The implementation, comprising approximately 2,400 lines of C++ architecture and Python tooling code, has been integrated into the main Faust distribution.

2606.11836 2026-06-12 cs.SD cs.AI eess.AS 版本更新

Towards Data-free and Training-free Compression for Speech Foundation Models Using Parameter Clustering

面向语音基础模型的无数据无训练压缩:基于参数聚类的方法

Haoning Xu, Zhaoqing Li, Huimeng Wang, Youjun Chen, Chengxi Deng, Mengzhe Geng, Xunying Liu

发表机构 * The Chinese University of Hong Kong(香港中文大学) National Research Council Canada(加拿大国家研究委员会)

AI总结 提出一种基于k-means通道聚类的无数据无训练压缩方法,通过层间不同参数簇数实现细粒度混合稀疏剪枝,在HuBERT-large和Whisper-large-v3上显著降低WER。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

本文提出了一种新颖的无数据无训练压缩方法,用于语音基础模型,该方法通过k-means进行通道级聚类。还探索了更细粒度的混合稀疏剪枝,通过层间不同数量的参数簇实现。在LibriSpeech数据集上进行的实验表明,当对HuBERT-large进行50%的剪枝稀疏度操作时,在微调前,测试干净和测试其他子集上,相对于基于幅度的剪枝,获得了27.73%/18.61%绝对(34.37%/21.91%相对)的一致WER降低;在仅3个epoch的微调后,获得了0.19%/0.79%绝对(3.36%/4.62%相对)的降低。在Whisper-large-v3上,在10%稀疏度下,相对于基于幅度的剪枝,观察到2.86%/5.02%绝对(59.21%/55.29%相对)的类似WER降低,所有这些相对于未压缩基线均没有显著的WER增加。

英文摘要

This paper presents a novel data-free and training-free compression approach for speech foundation models using channelwise clustering via k-means. More fine-grained, mixed sparsity pruning by layer-level varying number of parameter clusters is also explored. Experiments conducted on the LibriSpeech dataset suggest that when operating with pruning sparsity of 50% on HuBERT-large, consistent WER reductions of 27.73%/18.61% absolute (34.37%/21.91% relative) over the magnitude-based pruning were obtained on the test-clean and test-other subsets before fine-tuning and 0.19%/0.79% absolute (3.36%/4.62% relative) after fine-tuning with only 3 epochs. Similar WER reductions of 2.86%/5.02% absolute (59.21%/55.29% relative) were observed against magnitudebased pruning on Whisper-large-v3 at 10% sparsity, all with no significant WER increase relative to the uncompressed baseline.

2302.01090 2026-06-12 cs.SD cs.IR eess.AS 版本更新

Goniometers are a Powerful Acoustic Feature for Music Information Retrieval Tasks

角度仪是音乐信息检索任务中一种强大的音频特征

Tim Ziemer

发表机构 * University of Hamburg(汉堡大学)

AI总结 本文探讨了角度仪在音乐信息检索中的应用,通过自组织映射验证其在分类和聚类中的有效性,强调其因果性优势。

详情
Journal ref
Fortschritte der Akustik (DAGA) 2023
AI中文摘要

角度仪,也称为相位图或向量图,是音频测量工具,帮助音乐制作人和混音工程师监测音乐混音的空间特性,如立体声全景、单个声源的宽度、回声的量和扩散度以及可能发生的相位抵消。此外,它们隐含地提供了声音的动力学信息。通过训练自组织映射来探索这种音频特征在音乐信息检索任务中的有用性。可以观察到,角度仪能够区分不同流派并聚类单张专辑。角度仪的优势在于因果性:音乐制作人和混音工程师有意识地查阅角度仪以达到期望的声音,而其他音频特征如零穿越率到梅尔频率倒谱系数则并非如此。

英文摘要

Goniometers, also known as Phase Scopes or Vector Scopes, are audio metering tools that help music producers and mixing engineers monitor spatial aspects of a music mix, such as the stereo panorama, the width of single sources, the amount and diffuseness of reverberation as well as phase cancellations that may occur on the sweet-spot and in a mono-mixdown. In addition, they implicitly inform about the dynamics of the sound. Self-organizing maps trained with a goniometer, are consulted to explore the usefulness of this acoustic feature for music information retrieval tasks. One can see that goniometers are able to classify different genres and cluster a single album. The advantage of goniometers is the causality: Music producers and mixing engineers consciously consult goniometers to reach their desired sound, which is not the case for other acoustic features, from Zero-Crossing Rate to Mel-Frequency Cepstral Coefficients.