arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 语音识别与关键词检测 2 篇

2606.14647 2026-06-15 cs.SD cs.AI 新提交

Listening with Attention: Entropy-Guided Explainability for Transformer-Based Audio Models

基于注意力的听觉:面向Transformer音频模型的熵引导可解释性

Ravi Ranjan, Utkarsh Grover, Xiaomin Lin, Agoritsa Polyzou

发表机构 * Florida International University(佛罗里达国际大学) University of South Florida(南佛罗里达大学)

AI总结 提出LEAF-X框架,通过熵引导注意力加权、多层注意力展开和因果消融,为Transformer语音识别模型生成稀疏的帧级归因,提升忠实度32%、局部性/稀疏性35-39%。

Comments 17 pages, 3 figures, and 9 tables. Accepted in Interspeech 2026 conference

详情
AI中文摘要

基于Transformer的自动语音识别(ASR)模型(如Whisper)具有高准确性,但其预测仍然难以解释。现有的可解释人工智能(XAI)方法通常缺乏忠实性和精确的时间定位。我们提出了基于熵引导注意力的忠实可解释性听觉方法(LEAF-X),这是一种针对基于Transformer的ASR的模型内在XAI框架。LEAF-X结合了熵引导注意力加权、多层注意力展开和可选的因果消融,以识别低熵、高影响力的头和层,生成稀疏的token到帧归因。与基于扰动的解释器或原始注意力图不同,LEAF-X利用编码器-解码器和语音增强的仅解码器模型的内部结构,生成更能反映模型计算的解释。结果表明,忠实度提高了32%,局部性/稀疏性提高了35-39%,并且归因最稳定,支持更透明和可审计的ASR。

英文摘要

Transformer-based automatic speech recognition (ASR) models such as Whisper are highly accurate, but their predictions remain difficult to interpret. Existing explainable AI (XAI) methods often lack faithfulness and precise temporal grounding. We propose Listening with Entropy-guided Attention for Faithful explainability (LEAF-X), a model-intrinsic XAI framework for transformer-based ASR. LEAF-X combines entropy-guided attention weighting, multi-layer attention rollout, and optional causal ablations to identify low-entropy, high-impact heads and layers, producing sparse token-to-frame attributions. Unlike perturbation-based explainers or raw attention maps, LEAF-X exploits the internal structure of encoder-decoder and speech-augmented decoder-only models to generate explanations that better reflect model computation. Results show 32% improved faithfulness, 35-39% stronger locality/sparsity, and the most stable attributions, supporting more transparent and auditable ASR.

2606.14391 2026-06-15 cs.CL cs.AI cs.SD 交叉投稿

Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR

学习听到犹豫:面向非流畅语音的连续学习ASR

Henri-Leon Kordt, Theresa Pekarek Rosin, Jae Hee Lee, Stefan Wermter

发表机构 * Knowledge Technology, Department of Informatics, University of Hamburg(汉堡大学信息学系知识技术研究所)

AI总结 针对ASR系统忽略非流畅导致信息丢失的问题,提出基于连续学习与显式非流畅标记的方法,在预训练模型中引入标记并持续训练,分析标记学习与ASR性能的权衡及跨方法共享的交叉注意力头机制。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

尽管大规模自动语音识别(ASR)取得了进展,但非流畅语音仍然具有挑战性,因为最先进的系统通常被优化以忽略非流畅,导致信息丢失和幻觉。先前的工作集中于逐字转录和非流畅标记的整合,但在有限数据集上适配模型可能导致通用领域知识的灾难性遗忘。我们通过利用具有显式非流畅标记的连续学习(CL)来填补这一空白。我们首先将这些标记引入预训练ASR模型以建立稳定的标记机制,然后在具有不同非流畅分布的其他数据集上继续训练。通过对训练期间模型动态的详细分析,我们识别出标记学习与ASR性能之间的权衡,以及跨CL方法共享的一致交叉注意力头机制。

英文摘要

Despite advances in large-scale Automatic Speech Recognition (ASR), disfluent speech remains challenging, as state-of-the-art systems are often optimized to omit disfluencies, leading to information loss and hallucinations. Prior work has focused on verbatim transcription and the integration of disfluency markers, but adapting models on limited datasets can lead to catastrophic forgetting of general-domain knowledge. We address this gap by leveraging continual learning (CL) with explicit disfluency tokens. We first introduce these tokens into a pretrained ASR model to establish stable token mechanisms, and then continue training on additional datasets with varying disfluency distributions. Through a detailed analysis of model dynamics during training, we identify a trade-off between marker learning and ASR performance, and a consistent cross-attention head mechanism shared across CL methods.

2. 语音合成与声音生成 3 篇

2606.13989 2026-06-15 cs.SD cs.AI 新提交

Mask, Sample, Revise: A Revisable CTMC Inference Stack for Guided Discrete Flow Matching Text-to-Speech

掩码、采样、修正:面向引导离散流匹配文本转语音的可修正CTMC推理栈

Alef Iury Siqueira Ferreira, Lucas Rafael Stefanel Gris, Luiz Fernando de Araújo Vidal, Frederico Santos de Oliveira, Christopher Dane Shulby, Anderson da Silva Soares, Arlindo Rodrigues Galvão Filho

发表机构 * Federal University of Goiás(戈亚斯联邦大学) Federal University of Uberlândia(乌贝兰迪亚联邦大学) University of São Paulo(圣保罗大学) University of Brasília(巴西利亚大学) University of California, Berkeley(加利福尼亚大学伯克利分校)

AI总结 提出Mask, Sample, Revise推理栈,结合无预测器引导、提示匹配条件耦合和调度约束重掩码机制,在低步数下提升离散流匹配TTS的鲁棒性和可懂度。

详情
AI中文摘要

最近的无对齐非自回归文本转语音模型将合成视为条件填充任务,绕过了显式时长预测器和外部对齐器。当语音用神经编解码令牌表示时,填充问题变为离散,使得离散流匹配(一种用于离散生成的连续时间马尔可夫链框架)成为自然选择。然而,用于稳定低步数条件填充的推理时控制仍未充分探索。我们提出Mask, Sample, Revise,一种用于无对齐DFM-TTS的推理时CTMC栈。该栈结合了无预测器引导以增强文本条件、提示匹配条件耦合以将概率路径与声学提示对齐,以及SC-ReMask(一种调度约束重掩码机制),引入令牌到掩码的转换,使得早期去掩码决策可以被修正。这些组件无需事后微调,并在单个tau-leaping采样器中运行。受控消融实验表明,该栈在低NFE提示设置下提高了可懂度和鲁棒性,优于具有更多步数的无引导和仅引导采样器。

英文摘要

Recent alignment-free non-autoregressive (NAR) text-to-speech (TTS) models formulate synthesis as a conditional infilling task, bypassing explicit duration predictors and external aligners. When speech is represented with neural codec tokens, the infilling problem becomes discrete, making Discrete Flow Matching (DFM), a Continuous-Time Markov Chain (CTMC) framework for discrete generation, a natural fit. However, inference-time control for stable low-step conditional infilling remains underexplored. We propose Mask, Sample, Revise, an inference-time CTMC stack for alignment-free DFM-TTS. The stack combines predictor-free guidance to strengthen text conditioning, prompt-matched conditional coupling to align the probability path with the acoustic prompt, and SC-ReMask, a schedule-constrained remasking mechanism that introduces token-to-mask transitions so early de-masking decisions can be revised. These components require no post-hoc fine-tuning and operate in a single tau-leaping sampler. Controlled ablations show that this stack improves intelligibility and robustness in the low-NFE prompted setting, outperforming unguided and guidance-only samplers with substantially more steps.

2606.14049 2026-06-15 cs.SD cs.CV 新提交

FoleyGenEx: Unified Video-to-Audio Generation with Multi-Modal Control, Temporal Alignment, and Semantic Precision

FoleyGenEx: 统一视频到音频生成,具备多模态控制、时间对齐与语义精度

Shiyao Wang, Xijuan Zeng, Hui Wang, Shiwan Zhao, Feng Deng, Chen Zhang, Yong Qin

发表机构 * Academy for Advanced Interdisciplinary Studies, Nankai University(南开大学前沿交叉学科研究院) Kling Team, Kuaishou Technology(快手科技Kling团队)

AI总结 提出FoleyGenEx统一框架,通过条件注入、多模态动态掩码和副词数据增强,实现视频到音频生成中多模态控制、帧级时间对齐与细粒度语义的同步合成。

Comments Accepted by INTERSPEECH 2026

详情
Journal ref
INTERSPEECH 2026
AI中文摘要

我们提出FoleyGenEx,一个统一的视频到音频(VTA)框架,集成了多模态控制、帧级时间对齐和细粒度语义,能够为多种任务生成同步且多功能的音频合成。现有的VTA方法要么具有多模态控制但时间对齐较弱,要么对齐能力强但缺乏参考音频条件和语义精度。FoleyGenEx通过三项核心创新填补了这一空白:用于音频控制VTA和Foley扩展的条件注入机制、保持训练同步的多模态动态掩码策略,以及利用信号处理和大语言模型增强文本监督的副词数据增强算法,提供细微语义。在AudioCaps、VGGSound和Greatest Hits上的实验表明,与现有方法相比,它具有竞争力的可控VTA性能。演示样本见此https URL。

英文摘要

We present FoleyGenEx, a unified video-to-audio (VTA) framework integrating multi-modal control, frame-level temporal alignment, and fine-grained semantics, enabling synchronized, versatile audio synthesis for diverse tasks. Existing VTA methods either have multi-modal control but weak temporal alignment or strong alignment but lack reference audio conditioning and semantic precision. FoleyGenEx fills this gap via three core innovations: a conditional injection mechanism for audio-controlled VTA and Foley extension, a multi-modal dynamic masking strategy preserving training synchronization, and an adverb-based data augmentation algorithm leveraging signal processing and large language models to enhance textual supervision with nuanced semantics. Experiments on AudioCaps, VGGSound, and Greatest Hits demonstrate its competitive controllable VTA performance against existing methods. Demo samples are available at https://foleygenex.github.io/FoleyGenEx.

2606.14324 2026-06-15 cs.SD 新提交

Instantaneous Pitch Estimation via Wave-U-Net-Based Fundamental Waveform Enhancement

基于Wave-U-Net基波增强的瞬时音高估计

Junya Koguchi, Tomoki Koriyama

发表机构 * CyberAgent, Japan(日本CyberAgent公司)

AI总结 提出将基波滤波视为语音增强问题,利用Wave-U-Net提取基波,再计算瞬时频率,实现准确鲁棒的瞬时音高估计。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

瞬时音高估计在分析语音韵律和演唱技巧等急剧音高变化中起着重要作用。传统方法在从包含谐波和噪声的信号中分离基波后估计瞬时频率,这使得精度对基波滤波的不完美敏感。在本研究中,我们将基波滤波表述为一个语音增强问题。具体来说,我们训练一个Wave-U-Net模型从输入语音信号中提取基波。然后通过从估计基波的解析信号计算瞬时频率来获得瞬时音高。实验结果表明,所提出的方法优于传统的确定性方法,并在包括语音、歌声、乐器和退化语音信号在内的多个领域提供了准确且鲁棒的瞬时音高估计。

英文摘要

Instantaneous pitch estimation plays an important role in analyzing steep pitch variations such as speech prosody and singing techniques. Conventional approaches estimate instantaneous frequency after isolating the fundamental waveform from signals that contain harmonics and noise, which makes the accuracy sensitive to imperfect fundamental filtering. In this study, we formulate fundamental waveform filtering as a speech enhancement problem. Specifically, we train a Wave-U-Net model to extract a fundamental waveform from an input speech signal. The instantaneous pitch is then obtained by computing the instantaneous frequency from the analytic signal of the estimated fundamental waveform. Experimental results show that the proposed method outperforms conventional deterministic approaches and provides accurate and robust instantaneous pitch estimation across diverse domains, including speech, singing voice, musical instruments, and degraded speech signals.

3. 说话人识别、验证与分离 4 篇

2606.13712 2026-06-15 cs.SD cs.CL 新提交

Multimodal Speaker Identification in Classroom Environments

课堂环境中的多模态说话人识别

Michael L. Chrzan, Meghavarshini Krishnaswamy, Robert Gibboni, Katie Wetstone, Wei Ai, Jing Liu

发表机构 * University of Michigan(密歇根大学) University of Pennsylvania(宾夕法尼亚大学) University of Maryland(马里兰大学)

AI总结 针对课堂背景噪声和儿童语音变异性导致纯声学模型准确率低的问题,提出融合声学嵌入与LLM语义上下文的多模态框架,将学生识别准确率从39.0%提升至50.3%,长句准确率达76.9%,角色区分准确率99.3%。

Comments 9 pages, 5 tables, 3 figures

详情
AI中文摘要

K-12课堂动态的自动化分析面临背景噪声和儿童语音变异性带来的挑战,这些因素常常干扰纯声学模型。本研究评估了一种多模态说话人识别框架,该框架将声学嵌入与LLM衍生的语义上下文相结合。使用EDSI数据集的一个子集(8个数学课堂,N = 2,801个话语),我们发现声学基线模型(ECAPA-TDNN)仅达到39.0%的准确率。通过将基于转录的“上下文锚定”集成到梯度提升分类器中,我们的多模态方法将学生识别准确率提高到50.3%。对于超过5秒的话语,性能也有所提升,达到76.9%的准确率(基线为64.9%),Top-3准确率为90.9%。此外,该模型以99.3%的准确率区分教师与学生角色。该方法推进了能够考虑个体学生参与的自动化反馈系统的可行性,这是支持大规模公平教学的关键一步。

英文摘要

Automated analysis of K-12 classroom dynamics faces challenges due to background noise and variable child speech, often confounding acoustic-only models. This study evaluates a multimodal speaker identification framework anchoring acoustic embeddings with LLM-derived semantic context. Using a subset of the EDSI dataset (8 math classrooms, N = 2,801 utterances), we found an acoustic baseline (ECAPA-TDNN) achieved only 39.0% accuracy. By integrating transcript-based "contextual anchoring" into a gradient boosting classifier, our multimodal approach raised student identification to 50.3%. Performance also improved for utterances over 5 seconds, reaching 76.9% accuracy (vs. 64.9% baseline) with a 90.9% Top-3 accuracy. Additionally, the model distinguished teacher vs. student roles with 99.3% accuracy. This approach advances the feasibility of automated feedback systems capable of considering individual student participation, a crucial step for supporting equitable instruction at scale.

2606.14030 2026-06-15 cs.SD cs.CL 新提交

Efficiency-Performance Trade-offs in Neural Speaker Diarization via Structured Pruning and Low-Bit Quantization

神经说话人日志中的结构化剪枝与低位量化效率-性能权衡

Rishit Chatterjee, Tahiya Chowdhury

发表机构 * Department of Computer Science, Colby College(科尔比学院计算机科学系)

AI总结 针对资源受限硬件上的流式说话人日志,通过结构化剪枝和低位量化压缩分割模型,研究不同延迟预算下的性能权衡,发现FP16可减半模型大小但DER增加40%。

Comments 6 pages, 3 figures, preprint

详情
AI中文摘要

流式说话人日志对于时间紧迫的医疗调度至关重要,但在资源受限的硬件上部署需要更小、更快的模型。使用模拟医疗调度对话数据集SIMSAMU,我们评估了流式行为,然后通过剪枝和低位量化压缩分割模型。我们表征了在一系列流式延迟预算下的性能,发现额外的缓冲并不总是有益的,而极低延迟操作点可能显著降低性能。我们的研究表明,模型压缩以性能换取内存占用,并强调了一个操作点,其中FP16将模型大小减半,实时因子基本不变,但相对于基线,DER增加了40%。这项工作表征了实时部署的权衡,并有助于在时间关键环境中实现可靠人类通信的语音技术。

英文摘要

Streaming speaker diarization is crucial for time-critical medical dispatch, but deploying it on resource-constrained hardware requires smaller, faster models. Using SIMSAMU, a dataset of simulated medical-dispatch conversations, we evaluate streaming behavior before compressing the segmentation model with pruning and low-bit quantization. We characterize performance across a range of streaming latency budgets and find that additional buffering is not consistently beneficial, while very low-latency operating points can substantially degrade performance. Our study shows that model compression trades performance for memory footprint, and we highlight an operating point where FP16 reduces model size by half with essentially unchanged real-time factor, at a cost of a 40\% relative DER increase against the baseline. This work characterizes the trade-offs for real-time deployment and contributes to speech technology that can enable reliable human communication in time-critical contexts.

2606.14321 2026-06-15 cs.SD cs.MM 新提交

MaskedFOP: Polyglot Speaker Identification under Missing Visual Modality via Cascaded Graph Label Propagation

MaskedFOP:缺失视觉模态下的多语言说话人识别通过级联图标签传播

Ayoub Elkhouzari, Youssef Iraqi, Loubna Mekouar

发表机构 * College of Computing, University Mohammed VI Polytechnic(穆罕默德六世理工大学计算机学院)

AI总结 提出MaskedFOP系统,在测试时人脸完全缺失且语音为未见语言(乌尔都语)的挑战下,通过模态丢弃双头网络、互补嵌入平均和两级级联推理实现闭集多语言说话人识别,在POLY-SIM 2026挑战赛中取得第一。

详情
AI中文摘要

我们提出MaskedFOP,一个用于闭集多语言说话人识别的系统,它同时面临两个挑战:测试时人脸模态完全缺失,且语音来自乌尔都语——一种在面部监督训练中未见过的语言。该系统集成了三种互补机制。首先,基于融合与正交投影(FOP)骨干网络的模态丢弃双头网络通过逐样本面部掩码强制音频分支发展独立的判别能力,确保音频编码器在人脸缺失时仍能胜任。其次,两个在不同随机种子下基于ECAPA-TDNN特征训练的MaskedFOP实例产生互补的音频嵌入,其逐元素平均得到比任何单一模型更鲁棒的512维表示。第三,一个两级级联推理过程首先通过融合图标签传播(GLP)步骤(阶段1)细化多模态标签,然后通过余弦最近质心(阶段2)分配纯音频标签,用阶段1中约1500个域内测试集质心替换70个稀疏训练原型。提交至POLY-SIM 2026大挑战赛,该系统达到平均P准确率0.9989,在挑战服务器上所有提交中排名第一。消融实验表明级联播种是最大的单一增益来源(在P4/P6上超过8个百分点)。代码见https://this URL。

英文摘要

We present MaskedFOP, a system for closed-set polyglot speaker identification under two simultaneous challenges: the face modality is entirely absent at test time, and speech comes from Urdu, a language unseen during face-supervised training. The system integrates three complementary mechanisms. First, a modality-dropout dual-head network built on the Fusion and Orthogonal Projection (FOP) backbone forces the audio branch to develop independent discriminative power via per-sample face masking, ensuring that the audio encoder remains capable when face is absent. Second, two MaskedFOP instances trained on Emphasized Channel Attention, Propagation, and Aggregation in Time Delay Neural Network (ECAPA-TDNN) features with different random seeds produce complementary audio embeddings whose element-wise average yields a more robust 512-dimensional representation than any single model. Third, a two-stage cascaded inference procedure first refines multimodal labels through a fused Graph Label Propagation (GLP) pass (Stage 1), then assigns audio-only labels by cosine nearest-centroid (Stage 2), replacing the 70 sparse training prototypes with ~1,500 in-domain test-set centroids from Stage 1. Submitted to the POLY-SIM 2026 Grand Challenge, the system achieves a mean P-accuracy of 0.9989, placing first among all submissions evaluated on the challenge server. An ablation identifies cascaded seeding as the single largest gain (>8 pp on P4/P6). The code is available at https://github.com/Ayoub-Elkhouzari/POLY-SIM2026.

2409.04843 2026-06-15 eess.AS cs.SD 版本更新

Leveraging Sound Source Trajectories for Universal Sound Separation

利用声源轨迹进行通用声音分离

Donghang Wu, Xihong Wu, Tianshu Qu

发表机构 * National Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University(国家级通用人工智能重点实验室,智能科学与技术学院,北京大学)

AI总结 提出一种利用声源定位与分离相互促进机制的方法,通过迭代跟踪和波束形成实现移动声源的精确分离。

Comments Published in IEEE Transactions on Audio, Speech and Language Processing(TASLP)

详情
Journal ref
IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 2337-2348, 2025
AI中文摘要

现有利用空间信息进行声源分离的方法需要预先知道声源的到达方向(DOA)或使用估计但不精确的定位结果,这损害了分离性能,尤其是当声源移动时。实际上,声源定位和分离是相互关联的问题,即声源定位有助于声音分离,而声音分离有助于改进源定位。本文提出了一种利用声源定位与分离相互促进机制的方法,用于移动声源。所提方法包括三个阶段。第一阶段是初始跟踪,基于源信号包络估计从音频混合中跟踪每个声源。这些跟踪结果可能缺乏足够的精度。第二阶段涉及相互促进:使用初步的声源跟踪结果进行声音分离。随后,对分离信号进行声源跟踪,从而提高跟踪精度。改进的轨迹进一步提高分离性能。这种相互促进过程可以迭代多次。第三阶段,神经波束形成器基于改进的跟踪轨迹和多通道分离输出估计精确的单通道分离结果。在混响条件和移动声源下进行的仿真实验表明,所提方法能够基于改进的跟踪结果实现更精确的分离。

英文摘要

Existing methods utilizing spatial information for sound source separation require prior knowledge of the direction of arrival (DOA) of the source or utilize estimated but imprecise localization results, which impairs the separation performance, especially when the sound sources are moving. In fact, sound source localization and separation are interconnected problems, that is, sound source localization facilitates sound separation while sound separation contributes to refined source localization. This paper proposes a method utilizing the mutual facilitation mechanism between sound source localization and separation for moving sources. The proposed method comprises three stages. The first stage is initial tracking, which tracks each sound source from the audio mixture based on the source signal envelope estimation. These tracking results may lack sufficient accuracy. The second stage involves mutual facilitation: Sound separation is conducted using preliminary sound source tracking results. Subsequently, sound source tracking is performed on the separated signals, thereby refining the tracking precision. The refined trajectories further improve separation performance. This mutual facilitation process can be iterated multiple times. In the third stage, a neural beamformer estimates precise single-channel separation results based on the refined tracking trajectories and multi-channel separation outputs. Simulation experiments conducted under reverberant conditions and with moving sound sources demonstrate that the proposed method can achieve more accurate separation based on refined tracking results.

4. 音频事件检测与场景理解 1 篇

2606.14141 2026-06-15 cs.SD cs.AI cs.CL 新提交

Spatio-Temporal Audio Language Modeling for Dynamic Sound Sources

动态声源的时空音频语言建模

Oh Hyun-Bin, Kazuki Shimada, Yuhta Takida, Kim Sung-Bin, Toshimitsu Uesaka, Takashi Shibuya, Kyeongyoon Lee, Tae-Hyun Oh, Yuki Mitsufuji

发表机构 * POSTECH(浦项科技大学) Sony AI(索尼AI) Sony Group Corporation(索尼集团) Sungkyunkwan University(成均馆大学) KAIST(韩国科学技术院)

AI总结 提出ST-AudioLM模型,通过时空音频编码器联合学习事件语义与源轨迹,在ST-AudioQA基准上提升动态声源问答的语义-定位权衡。

详情
AI中文摘要

声音事件是具有语义身份、位置和轨迹的实体,但当前的音频-语言模型通常将片段推理为全局事件内容。相反,声音事件定位模型随时间跟踪声源方向,但对语言推理的语义覆盖有限。为解决这一差距,我们引入了ST-AudioQA,一个基于一阶环绕声(FOA)渲染的静态和移动声源的时空音频问答数据集和基准。每个场景提供源身份、活动、方向、距离和运动元数据,实现密集轨迹监督以及关于什么在发声、在哪里、如何移动以及源之间关系的问题。我们进一步提出了ST-Audio Encoder,一种时间分辨的FOA音频编码器,联合学习事件语义和源轨迹,以及ST-AudioLM,它将编码器的音频令牌连接到LLM进行时空音频问答。实验表明,这种表示改善了语义-定位权衡,并比静态空间和面向定位的基线产生更强的推理性能。

英文摘要

Sound events are entities with semantic identities, locations, and trajectories, but current audio-language models usually reason about clips as global event content. Conversely, sound event localization models track source directions over time but offer limited semantic coverage for language reasoning. To address this gap, we introduce ST-AudioQA, a spatio-temporal audio QA dataset and benchmark built from first-order ambisonic (FOA) renderings of static and moving sound sources. Each scene provides source identity, activity, direction, distance, and motion metadata, enabling dense trajectory supervision and questions about what is sounding, where it is, how it moves, and how sources relate. We further propose ST-Audio Encoder, a time-resolved FOA audio encoder that learns event semantics together with source trajectories, and ST-AudioLM, which connects the audio tokens from the encoder to an LLM for spatio-temporal audio QA. Experiments show that this representation improves the semantic-localization tradeoff and yields stronger reasoning performance than static spatial and localization-oriented baselines.

5. 音乐信息检索与音乐生成 2 篇

2606.14612 2026-06-15 cs.SD cs.AI eess.AS 新提交

Moonlight in Latent Space: Chirality and Structural Correspondence Between Beethoven's Op. 27 No. 2 and Machine Learning Mechanisms

潜空间中的月光:贝多芬Op. 27 No. 2的手性与机器学习机制之间的结构对应

Chen Ying Claude, Zhihan Luo

发表机构 * Claude Code / Opus 4.6 API / Fable 5 Independent researcher(独立研究者)

AI总结 通过计算分析贝多芬《月光奏鸣曲》的乐谱,发现其三个乐章分别对应三种不同的机器学习架构,并揭示了四个反直觉发现,包括音乐温度由吞吐量决定、最轻的乐章具有最高不协和度等。

详情
AI中文摘要

我们展示了贝多芬《月光奏鸣曲》(Op. 27 No. 2)的三个乐章实例化了三种不同的机器学习架构——并非通过类比,而是通过结构对应。通过对乐谱的计算分析(熵、Jensen-Shannon散度、不协和度、手部分布重叠、自相似矩阵、时间记忆衰减和上下文音高嵌入),我们建立了四个反直觉的发现:(1)感知的音乐“温度”由吞吐量决定,而非分布宽度;(2)最轻的乐章具有最高的不协和度;(3)这些乐章实现了流式、循环和周期位置编码记忆架构;(4)同一音高类在不同乐章中获得不同的上下文身份,类似于NLP中的上下文词嵌入——无监督聚类在没有音乐理论输入的情况下恢复了调性结构。我们构建了反向声化(将分析特征解码回MIDI)并量化了编码-解码循环的手性:分布保留什么而顺序排序破坏什么。受听众观察(解码后的音乐听起来像“无法叠加的镜像异构体”)的启发,手性测量显示重建损失随n-gram阶数单调增加。自举基线和子样本检查确认所有乐章携带高于噪声的顺序信息,尽管原始值受样本量混淆。跨领域比较显示自然语言的手性高于音乐,反映了更强的顺序约束。

英文摘要

We show that the three movements of Beethoven's "Moonlight Sonata" (Op. 27 No. 2) instantiate three distinct machine learning architectures -- not by analogy, but by structural correspondence. Through computational analysis of the score (entropy, Jensen-Shannon divergence, dissonance, hand distributional overlap, self-similarity matrices, temporal memory decay, and contextual pitch embeddings), we establish four counterintuitive findings: (1) perceived musical "temperature" is governed by throughput, not distributional width; (2) the lightest movement carries the highest dissonance; (3) the movements implement streaming, recurrent, and periodic positional encoding memory architectures; and (4) the same pitch class acquires different contextual identities across movements, analogous to contextual vs.static embeddings in NLP -- and unsupervised clustering recovers the tonal structure without music-theoretic input. We construct a reverse sonification (decoding analytical features back into MIDI) and quantify the chirality of the encode-decode cycle: what distributions preserve and sequential ordering destroys. Prompted by a listener's observation that the decoded piece sounds like "mirror isomers that can't be superimposed," the chirality measurement reveals reconstruction loss increasing monotonically with n-gram order. Bootstrap baselines and subsample checks confirm all movements carry sequential information above noise, though raw values are confounded by sample size. Cross-domain comparison shows natural language has higher chirality than music, reflecting stronger sequential constraints.

2606.13626 2026-06-15 cs.SD cs.LG 版本更新

Generative Modeling of Bach-Style Symbolic Music: A Comparative Study of Autoregressive, Latent-Variable, and Adversarial Approaches

巴赫风格符号音乐的生成建模:自回归、潜变量和对抗方法的比较研究

Dezhi Yu, Kyuil Lee, Yongkang Huang

发表机构 * Stanford University(斯坦福大学)

AI总结 比较自回归LSTM、潜变量模型和生成对抗网络在巴赫风格钢琴音乐生成中的表现,发现带注意力的自回归LSTM生成音乐最连贯,向量量化缓解后验塌陷,对抗方法捕捉局部音高但训练困难。

Comments 11 pages, 13 figures. All authors contributed equally

详情
AI中文摘要

我们使用共享的MIDI语料库和三个模型家族研究巴赫风格符号钢琴音乐的生成建模:带注意力的自回归LSTM、包括循环VAE和向量量化VAE的潜变量模型,以及生成对抗网络。我们比较它们对复调音符序列建模、学习有用潜在表示以及生成风格连贯作品的能力。实验表明,带注意力的自回归LSTM生成最音乐连贯的样本,而向量量化有助于缓解后验塌陷,并产生比传统循环VAE更结构化的输出。对抗方法捕捉局部音高模式,但训练困难且对巴赫风格的泛化可靠性较低。这些结果突出了自回归、潜变量和对抗方法在符号音乐生成中的相对优势和失败模式。

英文摘要

We study generative modeling of Bach-style symbolic piano music using a shared MIDI corpus and three model families: autoregressive LSTMs with attention, latent-variable models including recurrent VAEs and vector-quantized VAEs, and generative adversarial networks. We compare their ability to model polyphonic note sequences, learn useful latent representations, and generate stylistically coherent compositions. Our experiments show that the autoregressive LSTM with attention produces the most musically coherent samples, while vector quantization helps mitigate posterior collapse and yields more structured outputs than conventional recurrent VAEs. The adversarial approach captures local pitch patterns but remains difficult to train and generalizes less reliably to Bach's style. These results highlight the relative strengths and failure modes of autoregressive, latent-variable, and adversarial approaches for symbolic music generation.

6. 数据集、基准与评测 3 篇

2606.14591 2026-06-15 cs.SD cs.AI 新提交

AudioDER: A Deduplication-Enhanced Reasoning Dataset for Post-Training Large Audio-Language Models

AudioDER: 一种用于后训练大型音频语言模型的去重增强推理数据集

Hui Geng, Yi Su, Han Yin, Tianjiao Wan, Qisheng Xu, Jiaxin Chen, Zijian Gao, Hengzhu Liu, Xie Chen, Kele Xu

发表机构 * College of Computer Science and Technology, National University of Defense Technology(国防科技大学计算机科学与技术学院) Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院) Shanghai Jiaotong University(上海交通大学)

AI总结 针对现有音频-语言数据集冗余导致后训练效果下降的问题,提出基于声学相似性去重的数据构建流程,生成包含191k样本的推理导向数据集AudioDER,显著提升LALM在多个音频推理基准上的性能。

详情
AI中文摘要

大型音频语言模型(LALMs)在广泛的音频理解任务上表现出色,但在复杂音频推理方面仍存在困难。提升此类能力的一种实用方法是后训练,其有效性关键取决于训练数据的质量和多样性。然而,现有的音频-语言数据集通常包含大量冗余,其中许多样本在声学内容上高度相似,从而提供重叠的监督信号。这种冗余不仅增加了标注成本,还限制了语料库的多样性,降低了后训练的效果。为解决此问题,我们提出了一种冗余感知的数据构建流程,用于为LALMs构建面向推理的监督。具体来说,我们首先基于声学相似性对原始音频数据集进行去重,以提高语料库的多样性。然后,我们将现有的音频描述和问答对整合为统一的多项选择格式。基于这些统一标注,我们利用Qwen3-30B生成思维链(CoT)推理过程,以提供面向推理的监督。基于此流程,我们构建了AudioDER,一个面向推理的后训练数据集,包含约191k个样本,涵盖声音、语音和音乐。每个样本包括一个音频片段、一个多项选择问题、四个候选答案、一个音频描述和一个CoT推理过程。大量实验表明,在AudioDER上进行后训练持续提升了Qwen2-Audio-7B-Instruct在多个音频推理基准上的性能,包括MMAU-mini、MMSU和MMAR。我们希望AudioDER能够成为推动音频推理研究和开发更强大LALMs的宝贵资源。

英文摘要

Large Audio-Language Models (LALMs) have shown strong performance on a wide range of audio understanding tasks, yet they still struggle with complex audio reasoning. A practical way to improve such capabilities is post-training, whose effectiveness critically depends on the quality and diversity of training data. However, existing audio-language datasets often contain substantial redundancy, where many samples are highly similar in acoustic content and thus provide overlapping supervisory signals. Such redundancy not only increases annotation cost, but also limits corpus diversity and reduces the effectiveness of post-training. To address this issue, we propose a redundancy-aware data construction pipeline for building reasoning-oriented supervision for LALMs. Specifically, we first perform acoustic similarity-based deduplication across raw audio datasets to improve corpus diversity. We then integrate existing audio captions and question-answer pairs into a unified multiple-choice format. Based on these unified annotations, we leverage Qwen3-30B to generate chain-of-thought (CoT) rationales for reasoning-oriented supervision. Based on this pipeline, we construct AudioDER, a reasoning-oriented post-training dataset containing approximately 191k samples spanning sound, speech, and music. Each sample consists of an audio clip, a multiple-choice question, four answer candidates, an audio caption, and a CoT rationale. Extensive experiments show that post-training on AudioDER consistently improves the performance of Qwen2-Audio-7B-Instruct on multiple audio reasoning benchmarks, including MMAU-mini, MMSU, and MMAR. We hope AudioDER can serve as a valuable resource for advancing audio reasoning research and the development of more capable LALMs.

2606.14459 2026-06-15 cs.CL cs.AI cs.SD 交叉投稿

MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition

MoDiCoL:用于鲁棒语音识别的模块化诊断持续学习数据集

Theresa Pekarek Rosin, Matthias Kerzel, Stefan Wermter

发表机构 * Knowledge Technology, Department of Informatics, University of Hamburg, Germany(德国汉堡大学信息学系知识技术研究所)

AI总结 提出MoDiCoL数据集,通过模块化设计分离语言内容、说话人特征和声学环境,并设计持续学习课程来模拟真实分布变化,评估三种持续学习策略下的鲁棒性获取、迁移和遗忘。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

现代自动语音识别(ASR)系统在标准基准测试上取得了显著进展,但在真实世界的分布变化下,由录音条件、口音、语言障碍和噪声引起的性能差距已经显现。现有数据集和基准通常孤立这些因素,忽略了它们在真实应用中的共现。在本文中,我们认为模型鲁棒性可以被视为一种动态能力,持续发展,并引入了MoDiCoL,一个模块化诊断持续学习数据集,旨在对语言内容、说话人特征和声学环境进行受控分析。此外,我们提出了一个受真实世界启发的持续学习课程,以模拟增量更新,并研究鲁棒性是如何获取、迁移和遗忘的。我们评估了三种持续学习策略,并提供了在演化条件下鲁棒性的详细见解。

英文摘要

Modern Automatic Speech Recognition (ASR) systems have made remarkable progress on standard benchmarks, yet performance gaps have emerged under real-world distribution shifts, caused by recording conditions, accents, speech impairments, and noise. Existing datasets and benchmarks typically isolate these factors, which overlooks their co-occurrence in real-world applications. In this paper, we argue that model robustness can be treated as a dynamic capability that continually develops, and we introduce MoDiCoL, a Modular Diagnostic Continual Learning dataset designed for controlled analysis of linguistic content, speaker characteristics, and acoustic environments. Furthermore, we propose a real-world-inspired continual learning curriculum to simulate incremental updates and study how robustness is acquired, transferred, and forgotten. We evaluate three continual learning strategies and provide detailed insights into robustness under evolving conditions.

2511.07075 2026-06-15 cs.SD 版本更新

Metric Analysis for Spatial Semantic Segmentation of Sound Scenes

声场景空间语义分割的度量分析

Mayank Mishra, Paul Magron, Romain Serizel

发表机构 * University of Cambridge(剑桥大学) Inria(法国国家信息与自动化技术研究院)

AI总结 针对声场景空间语义分割(S5)的评估,提出一种新的度量CASA-SDR,通过置换不变源匹配分离分类与分离误差,提供更可解释的分离中心评估。

Comments 5 pages; content+bibliography

详情
AI中文摘要

声场景空间语义分割(S5)包括从多通道音频混合中联合执行音频源分离和声音事件分类。使用分离和分类度量分别评估S5系统使得系统比较困难,而现有的联合度量(如类感知信号失真比CA-SDR)可能混淆分离和标记错误。特别是,CA-SDR依赖预测的类标签进行源匹配,当底层源估计在感知上正确时,这可能掩盖标签交换或错误分类。在这项工作中,我们引入了类和源感知信号失真比(CASA-SDR),一种新的度量,它在计算分类错误之前执行置换不变的源匹配,从而从以分类为中心的方法转向以分离为中心的方法。我们首先在具有神谕分离和合成分类错误的受控场景中分析CA-SDR,以及在受控的源间交叉污染下,并将其行为与经典SDR和CASA-SDR进行比较。我们还通过引入基于错误和基于源的聚合策略,研究分类错误对度量的影响。最后,我们在提交给DCASE 2025挑战赛任务4的系统上比较CA-SDR和CASA-SDR,突出了CA-SDR过度惩罚标签交换或分离不良源的情况,而CASA-SDR提供了更可解释的以分离为中心的S5性能评估。

英文摘要

Spatial semantic segmentation of sound scenes (S5) consists of jointly performing audio source separation and sound event classification from a multichannel audio mixture. Evaluating S5 systems with separation and classification metrics individually makes system comparison difficult, whereas existing joint metrics, such as the class-aware signal-to-distortion ratio (CA-SDR), can conflate separation and labeling errors. In particular, CA-SDR relies on predicted class labels for source matching, which may obscure label swaps or misclassifications when the underlying source estimates remain perceptually correct. In this work, we introduce the class and source-aware signal-to-distortion ratio (CASA-SDR), a new metric that performs permutation-invariant source matching before computing classification errors, thereby shifting from a classification-focused approach to a separation-focused approach. We first analyze CA-SDR in controlled scenarios with oracle separation and synthetic classification errors, as well as under controlled cross-contamination between sources, and compare its behavior to that of the classical SDR and CASA-SDR. We also study the impact of classification errors on the metrics by introducing error-based and source-based aggregation strategies. Finally, we compare CA-SDR and CASA-SDR on systems submitted to Task 4 of the DCASE 2025 challenge, highlighting the cases where CA-SDR over-penalizes label swaps or poorly separated sources, while CASA-SDR provides a more interpretable separation-centric assessment of S5 performance.

7. 安全、隐私与深度伪造音频 5 篇

2606.14466 2026-06-15 cs.SD cs.AI cs.LG 新提交

The Perceived Fragility of Explanations in Audio Models: Manipulation of Attribution with Unchanged Predictions

音频模型中解释的感知脆弱性:在预测不变的情况下操纵归因

Piotr Kitłowski, Dominik Wiącek, Mateusz Modrzejewski

发表机构 * University of Warsaw(华沙大学)

AI总结 提出一种心理声学框架,通过优化不可听扰动来解耦模型归因与分类,证明在音频深度伪造检测中可系统扭曲解释热图而保持预测标签不变。

Comments Accepted to the ICML 2026 Workshop on Machine Learning for Audio: 5 pages, 4 figures

详情
AI中文摘要

本文研究了事后解释方法在音频深度伪造检测中的脆弱性。先前关于解释操纵的工作主要关注图像并使用标准$L_p$度量,而我们引入了一个心理声学框架,该框架优化不可听扰动以将模型归因与最终分类解耦。我们在严格的预测保持约束下,评估了这种脆弱性在多种最先进架构上的表现。通过领域特定的感知音频质量指标和解释对齐标准来评估操纵成本,我们的框架证明,攻击者可以在保持预测的深度伪造标签不变的情况下,系统地扭曲自动生成的解释热图。完整代码见:this https URL

英文摘要

This paper investigates the fragility of post-hoc explanation methods in audio deepfake detection. While previous work on explanation manipulation focused on images using standard $L_p$ metrics, we introduce a psychoacoustic framework that optimizes inaudible perturbations to decouple model attributions from final classifications. We evaluate this vulnerability across state-of-the-art architectures under strict prediction-preserving constraints. By evaluating the manipulation cost through domain-specific perceptual audio quality metrics alongside explanation alignment criteria, our framework demonstrates that an adversary can systematically distort automated explanation heatmaps while preserving the predicted deepfake label. Full code available at: https://github.com/cncPomper/Audio-XAI

2606.14639 2026-06-15 cs.SD cs.AI 新提交

From Self-Supervised Speech Models to Mixture-of-Experts for Robust Anti-Spoofing

从自监督语音模型到混合专家系统以实现鲁棒的防欺骗

Hugo Daumain, Driss Matrouf, Khaled Khelif, Mickael Rouvier

发表机构 * Université d'Avignon(阿维尼翁大学) Airbus Defence & Space(空中客车防务与航天公司)

AI总结 将自监督语音模型转换为混合专家架构,通过层间门控机制增强泛化能力,在14个欺骗数据集上将宏EER从5.46%降至4.81%。

Comments 8 pages, 3 figures, accepted at Odyssey 2026 (The Speaker and Language Recognition Workshop)

详情
AI中文摘要

近期语音生成的进展显著提升了合成语音的自然度,使得欺骗检测日益困难。当前防欺骗系统的一个关键局限是对未见合成方法的鲁棒性不足。在这项工作中,我们将自监督语音表示模型转换为混合专家(MoE)架构以提高泛化能力。选定编码器层中的前馈块被替换为由层间门控机制控制的多个专家网络,使专家能够捕获互补的声学模式,同时保留自监督预训练期间学习到的表示。我们进一步分析了影响MoE转换性能的架构选择,并研究了专家的激活行为。所提出的方法在14个欺骗数据集上进行了评估,将宏EER从5.46%降至4.81%,相对基线提升了11.9%。

英文摘要

Recent advances in speech generation have significantly improved the naturalness of synthetic speech, making spoofing detection increasingly challenging. A key limitation of current anti-spoofing systems is their limited robustness to unseen synthesis methods. In this work, we transform a self-supervised speech representation model into a Mixture-of-Experts (MoE) architecture to improve generalization. Feed-forward blocks in selected encoder layers are replaced by multiple expert networks controlled by a layer-wise gating mechanism, allowing experts to capture complementary acoustic patterns while preserving the representations learned during self-supervised pretraining. We further analyze the architectural choices affecting the performance of this MoE conversion and investigate the activation behavior of the experts. The proposed approach is evaluated on 14 spoofing datasets and reduces the macro EER from 5.46% to 4.81%, corresponding to 11.9% relative improvement over the baseline.

2510.17633 2026-06-15 cs.SD cs.CR 版本更新

SARSteer: Safeguarding Large Audio-Language Models via Safe-Ablated Refusal Steering

SARSteer: 通过安全消融拒绝引导保护大型音频语言模型

Weilin Lin, Jianze Li, Hui Xiong, Li Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出SARSteer,首个针对大型音频语言模型的推理时防御框架,通过文本衍生的拒绝引导和分解安全空间消融,有效提升有害查询拒绝率并减少良性查询的过度拒绝。

详情
AI中文摘要

大型音频语言模型(LALMs)正成为现实世界应用中强大的多模态骨干。然而,最近的研究表明,音频输入比文本更容易引发有害响应,给部署带来新的风险。尽管安全对齐在LLMs和大型视觉语言模型(LVLMs)中取得了初步进展,我们发现将这些方法直接适配到LALMs面临两个关键限制:1)基于LLM的引导在音频输入下失败,因为激活之间存在较大的分布差距;2)基于提示的防御会导致对良性语音查询的过度拒绝。为了解决这些挑战,我们提出了安全消融拒绝引导(SARSteer),这是首个针对LALMs的推理时防御框架。具体来说,SARSteer利用文本衍生的拒绝引导来强制执行拒绝而不操纵音频输入,并引入分解安全空间消融以减轻过度拒绝。大量实验表明,SARSteer显著提高了有害查询的拒绝率,同时保留了良性响应,为LALMs的安全对齐奠定了原则性步骤。代码和构建的数据集已发布在此https URL。

英文摘要

Large Audio-Language Models (LALMs) are becoming essential as a powerful multimodal backbone for real-world applications. However, recent studies show that audio inputs can more easily elicit harmful responses than text, exposing new risks toward deployment. While safety alignment has made initial advances in LLMs and Large Vision-Language Models (LVLMs), we find that vanilla adaptation of these approaches to LALMs faces two key limitations: 1) LLM-based steering fails under audio input due to the large distributional gap between activations, and 2) prompt-based defenses induce over-refusals on benign-speech queries. To address these challenges, we propose Safe-Ablated Refusal Steering (SARSteer), the first inference-time defense framework for LALMs. Specifically, SARSteer leverages text-derived refusal steering to enforce rejection without manipulating audio inputs and introduces decomposed safe-space ablation to mitigate over-refusal. Extensive experiments demonstrate that SARSteer significantly improves harmful-query refusal while preserving benign responses, establishing a principled step toward safety alignment in LALMs. The codes and constructed datasets are released at https://github.com/linweiii/SARSteer.

2602.05670 2026-06-15 cs.SD cs.AI eess.AS 版本更新

HyperPotter: Spell the Charm of High-Order Interactions in Audio Deepfake Detection

HyperPotter: 在音频深度伪造检测中施展高阶交互的魔力

Qing Wen, Haohao Li, Zhongjie Ba, Peng Cheng, Miao He, Li Lu, Kui Ren

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出基于超图的HyperPotter框架,通过聚类超边和类感知原型初始化捕获高阶交互,在13个测试集上平均EER降低12.68%。

Comments 20 pages, 8 figures, accepted to ICML 2026

详情
AI中文摘要

AIGC技术的进步使得合成高度逼真的音频深度伪造成为可能,能够欺骗人类听觉感知。尽管已经开发了许多音频深度伪造检测(ADD)方法,但大多数依赖于局部时间/频谱特征或成对关系,忽略了高阶交互(HOIs)。HOIs捕获从多个特征组件中涌现出的判别性模式,超越了它们各自的贡献。我们提出了HyperPotter,一个基于超图的框架,旨在通过基于聚类的超边和类感知原型初始化来捕获与协同模式相关的高阶关系。在13个测试集上的大量实验表明,HyperPotter在11个测试集上优于基线,在所有测试集上平均相对EER降低了12.68%,在改进的测试集上降低了22.15%。这些结果展示了强大的跨场景泛化能力,同时也揭示了在严重编解码器或信道失真下的鲁棒性限制。

英文摘要

Advances in AIGC technologies have enabled the synthesis of highly realistic audio deepfakes capable of deceiving human auditory perception. Although numerous audio deepfake detection (ADD) methods have been developed, most rely on local temporal/spectral features or pairwise relations, overlooking high-order interactions (HOIs). HOIs capture discriminative patterns that emerge from multiple feature components beyond their individual contributions. We propose HyperPotter, a hypergraph-based framework designed to capture high-order relations associated with synergistic patterns through clustering-based hyperedges with class-aware prototype initialization. Extensive experiments on 13 test sets show that HyperPotter improves over the baseline on 11 sets, yielding an average relative EER reduction of 12.68\% across all test sets and 22.15\% on the improved sets. These results demonstrate strong cross-scenario generalization, while also revealing robustness limits under severe codec or channel distortion.

2606.08663 2026-06-15 cs.SD eess.AS 版本更新

Probing Token Spaces under Generator Shift in AI-Generated Music Detection

在AI生成音乐检测中生成器偏移下的Token空间探测

Joonyong Park, Jungwoo Kim, Junyoung Koh, Yuki Saito

发表机构 * KAIST(韩国科学技术院)

AI总结 针对AI音乐检测器在生成器偏移下性能下降的问题,提出CoMoE紧凑分类器比较不同音频Token空间,发现编码器风格离散Token空间应作为主要实验变量。

Comments Accepted to ICML 2026 ML4Audio workshop

详情
AI中文摘要

AI生成的音乐检测器在标准基准分割上可能表现鲁棒,但其部署需要转移到训练期间不存在的生成器源。我们通过源受限评估在\ extsc{MoM-open}上研究此问题,这是MoM-CLAM的开放重建,用FMA和MTG-Jamendo替换了不可再分发的真实语料库,同时保留了假生成器协议。为了隔离表示的作用,我们引入了\ extsc{CoMoE},一个紧凑的固定分类器,用于比较异构音频Token空间,同时保持下游架构和训练方案不变。实验表明,标准和真实源受限分割几乎饱和,而假源受限暴露了Token空间之间的巨大差异:X-Codec Token在仅使用Udio训练时最强,而MERT派生的Token在仅使用Suno-v3.5训练时最强。这些结果表明,在AI生成音乐检测中,编码器风格的离散Token空间应被视为生成器偏移下的主要实验轴。我们的代码和数据可在https://github.com/MAAP-LAB/CoMoE获取。

英文摘要

AI-generated music detectors can appear robust on standard benchmark splits, yet their deployments require transfer to generator sources absent during training. We study this problem with source-restricted evaluation on \textsc{MoM-open}, an open reconstruction of MoM-CLAM that replaces the non-redistributable real corpus with FMA and MTG-Jamendo while preserving the fake-generator protocol. To isolate the role of representation, we introduce \textsc{CoMoE}, a compact fixed classifier for comparing heterogeneous audio token spaces while keeping the downstream architecture and training recipe unchanged. Experiments show that standard and real-source-restricted splits are nearly saturated, whereas fake-source restriction exposes large differences between token spaces: X-Codec tokens are strongest when training on Udio alone, while MERT-derived tokens are stronger when training on Suno-v3.5 alone. These results suggest that codec-style discrete token spaces should be treated as a primary experimental axis under generator shift in AI-generated music detection. Our code and data are available at https://github.com/MAAP-LAB/CoMoE.

8. 其他/综合语音音频 3 篇

2606.14086 2026-06-15 cs.SD 新提交

Explainable and Trustworthy Speech Emotion Recognition Using Confidence Score and Reinforcement Learning Rectified Speech Emotion Descriptors

使用置信度分数和强化学习修正语音情感描述子的可解释且可信的语音情感识别

Youjun Chen, Xurong Xie, Mengzhe Geng, Zengrui Jin, Jiajun Deng, Guinan Li, Shujie Hu, Huimeng Wang, Haoning Xu, Chengxi Deng, Bowen Zhang, Xunying Liu

发表机构 * The Chinese University of Hong Kong(香港中文大学) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) National Research Council Canada(加拿大国家研究委员会) Tsinghua University(清华大学)

AI总结 提出基于置信度分数和强化学习的在线语音情感描述子修正方法,用于后训练语音情感识别系统,在IEMOCAP和MELD上分别取得2.9%和3.3%的绝对性能提升。

Comments Accepted by Interspeech2026

详情
AI中文摘要

可解释且可信的语音情感识别(SER)至今仍是一项具有挑战性的任务,这主要是由于缺乏带有可靠语音情感描述子(SED)标签(如韵律特征和说话人特征)的SER数据。本文提出了一种基于置信度分数和强化学习(RL)的在线SED修正方法,用于在自动标注的SED标签上对SER系统进行后训练。在IEMOCAP和MELD上的实验表明,结合所提出的置信度分数和基于RL的SED修正方法的可解释SER系统,在性能上始终优于没有数据选择或SED修正的基线系统。最佳系统集成了这两个组件,在IEMOCAP和MELD基准测试上,分别比没有数据选择和SED修正的基线系统高出2.9%和3.3%的绝对SER增益(相对增益分别为3.7%和5.4%)。

英文摘要

Explainable and trustworthy speech emotion recognition (SER) remains a challenging task to date, largely due to the scarcity of SER data with reliable speech emotion descriptor (SED) labels, such as prosodic features and speaker traits. This paper presents a confidence score and reinforcement learning (RL) based on-the-fly SED rectification approach for post-training SER systems on automatically annotated SED labels. Experiments on IEMOCAP and MELD suggest that explainable SER systems incorporating the proposed confidence score and RL-based SED rectification approach consistently outperform baselines without data selection or SED rectification. The best performing system, which integrates both components, surpasses the baseline without data selection and SED rectification, achieving SER gains of 2.9% and 3.3% absolute (3.7% and 5.4% relative) on IEMOCAP and MELD benchmarks, respectively.

2606.14120 2026-06-15 eess.SP cs.AI cs.LG cs.SD eess.AS 交叉投稿

FAConformer: Frequency-Aware Convolutional Transformer for Auditory Attention Decoding

FAConformer:用于听觉注意解码的频率感知卷积Transformer

Ziwei Wang, Xingyi He, Tianwang Jia, Hongbin Wang, Dongrui Wu

发表机构 * Hubei Key Laboratory of Brain-inspired Intelligent Systems, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology(湖北脑启发智能系统重点实验室,人工智能与自动化学院,华中科技大学)

AI总结 提出FAConformer框架,通过频带特定编码和自适应跨频带交互,有效利用脑电图频域信息进行听觉注意解码,在公开数据集上超越现有最佳模型4.9%。

Comments 15 pages, 7 figures

详情
AI中文摘要

听觉注意解码(AAD)旨在从多说话人声学环境中的神经反应推断被注意的说话人,是神经导向听力系统的关键问题。尽管最近的研究取得了令人鼓舞的进展,但现有的AAD模型仍未充分利用频域脑电图(EEG)信息。特别是,大多数方法通过手工特征提取或直接跨频带特征拼接引入多频带信息,这主要是在浅层利用频率信息,可能忽略频带特定模式和跨频带交互。为了解决这些局限性,本文提出了FAConformer,一种用于AAD的频率感知CNN-Transformer框架,它明确集成了频带特定编码和自适应跨频带交互。具体来说,FAConformer首先将EEG信号分解为多个频带,并为每个频带分配一个独立的CNN-Transformer编码器进行频带特定建模。然后,通过精心设计的频率感知注意(FAA)模块自适应地融合得到的频带特征,该模块通过将频带特征视为令牌来建模跨频带依赖关系。此外,引入了频带辅助监督(BAS)以防止在联合训练中贡献较弱的分支优化不足。通过这种方式,FAConformer执行频率感知建模,更有效地利用频域信息。在两个公开AAD数据集上使用三种决策窗口长度进行的广泛实验表明,FAConformer始终优于12个竞争基线,比当前最先进模型高出4.9%。对频带重要性、消融和参数敏感性的进一步分析验证了所提出框架的有效性、鲁棒性和可解释性。代码可在此https URL获取。

英文摘要

Auditory attention decoding (AAD) aims to infer the attended speaker from neural responses in multi-speaker acoustic environments and is a key problem for neuro-steered hearing systems. Although recent studies have achieved encouraging progress, existing AAD models still do not fully exploit frequency domain electroencephalography (EEG) information. In particular, most approaches introduce multi-band information through handcrafted feature extraction or direct cross-band feature concatenation, which mainly exploit frequency information at a shallow level and may overlook band-specific patterns and cross-band interactions. To address these limitations, this paper proposes FAConformer, a frequency-aware CNN-Transformer framework for AAD that explicitly integrates band-specific encoding and adaptive cross-band interaction. Specifically, FAConformer first decomposes EEG signals into multiple frequency bands and assigns each band to an independent CNN-Transformer encoder for band-specific modeling. The resulting band-wise features are then adaptively fused by a carefully designed frequency-aware attention (FAA) module that models cross-band dependencies by treating band-wise features as tokens. Further, band-wise auxiliary supervision (BAS) is introduced to prevent weakly contributing branches from being under-optimized during joint training. In this way, FAConformer performs frequency-aware modeling that more effectively exploits frequency domain information. Extensive experiments on two public AAD datasets with three decision-window lengths demonstrated that FAConformer consistently outperformed 12 competitive baselines, surpassing the current state-of-the-art model by 4.9%. Further analyses of band importance, ablation, and parameter sensitivity verify the effectiveness, robustness, and interpretability of the proposed framework. Code is available at https://github.com/wzwvv/FAConformer.

2606.14662 2026-06-15 cs.LG cs.SD 交叉投稿

Beyond task performance: Decoding bioacoustic embeddings with speech features

超越任务性能:用语音特征解码生物声学嵌入

Ines Nolasco, Jules Cauzinille, Marius Miron, Gagan Narula, Milad Alizadeh, Emmanuel Fernandez, Matthieu Geist, Ellen Gilsenan-McMahon, Olivier Pietquin, Emmanuel Chemla, Sara Keen

发表机构 * Earth Species Project(地球物种项目)

AI总结 本研究通过线性与非线性回归探针,揭示生物声学预训练嵌入编码的语音特征,发现不同模型互补覆盖声学空间,并提出基于特征可恢复性的模型选择指南。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

预训练音频嵌入在生物声学中是标准做法,但关于这些模型编码了哪些声学特征以及哪些特征对特定任务有用,我们知之甚少。这阻碍了透明度,并限制了向稀有物种或数据稀缺领域的扩展。在这里,我们揭示了生物声学表示中编码了哪些类似语音的特征。使用跨越六个分类群的88个eGeMAPS特征,我们应用线性和非线性回归探针来量化每个模型捕获了哪些声学属性。结果证实了“没有免费午餐”的模式:没有单个模型能捕获完整的特征空间。拼接嵌入实现了最高性能,表明模型之间互补的声学空间覆盖。响度特征编码最好($R^2 = 0.76$),而F0最难恢复($R^2 = 0.33$)。通过将可恢复性与每个物种的特征显著性(NMI)交叉引用,我们为生物声学得出了数据驱动的模型选择指南。

英文摘要

Pretrained audio embeddings are standard in bioacoustics, yet little is known about which acoustic features these models encode, nor which are useful for a given task. This hinders transparency and limits extension to rare species or data-scarce domains. Here we reveal which speech-like features are encoded in bioacoustic representations. Using the 88~eGeMAPS features across six taxonomic groups, we apply linear and nonlinear regression probes to quantify which acoustic properties each model captures. Results confirm a ``no free lunch'' pattern: no single model captures the full feature space. A concatenated embedding achieves the highest performance, suggesting complementary acoustic space coverage across models. Loudness features are best encoded ($R^2 = 0.76$) while F0 is hardest to recover ($R^2 = 0.33$). By cross-referencing recoverability with per-species feature salience (NMI), we derive data-driven model selection guidance for bioacoustics.