arXivDaily arXiv每日学术速递 周一至周五更新
2605.13651 2026-05-14 cs.SD cs.AI 版本更新

NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating

Zhongju Yuan, Geraint Wiggins, Dick Botteldooren

发表机构 * WAVES Research Group, Ghent University, Gent, Belgium(根特大学WAVES研究组,比利时根特) AI Lab, Vrije Universiteit Brussel, Brussel, Belgium(布鲁塞尔自由大学AI实验室,比利时布鲁塞尔) EECS, Queen Mary University of London, London, UK(伦敦大学学院女王学院电子工程与计算机科学系,英国伦敦)

AI总结 本文提出了一种无需训练的神经听觉注意力认知架构NAACA,用于解决长时音频中显著事件检测的注意力瓶颈问题。其核心是受神经系统启发的振荡工作记忆(OWM),能够通过感知显著性触发高层语言模型处理,从而提升事件检测精度并减少不必要的计算。实验表明,NAACA在XD-Violence数据集上显著提升了检测性能,并在城市声景数据集上表现出对噪声和突发停顿的良好鲁棒性。

Comments Accepted as a regular paper by ICML 2026

详情
英文摘要

Audio provides critical situational cues, yet current Audio Language Models (ALMs) face an attention bottleneck in long-form recordings where dominant background patterns can dilute rare, salient events. We introduce NAACA, a training-free NeuroAuditory Attentive Cognitive Architecture that reframes attention allocation as an auditory salience filtering problem. At its core is OWM, a neuro-inspired Oscillatory Working Memory that maintains stable attractor-like states and triggers higher-cognition ALM processing only when adaptive energy fluctuations signal perceptual salience, triggering higher-level reasoning. On XD-Violence, NAACA improves AudioQwen's average precision (AP) from 53.50% to 70.60% while reducing unnecessary ALM invocations. Furthermore, qualitative case studies on the Urban Soundscapes of the World (USoW) dataset show that OWM captures novel events and subcategory shifts while remaining robust to transient pauses and ambient urban noise.

2605.13431 2026-05-14 cs.SD 版本更新

Text2Score: Generating Sheet Music From Textual Prompts

Keshav Bhandari, Sungkyun Chang, Abhinaba Roy, Francesca Ronchini, Emmanouil Benetos, Dorien Herremans, Simon Colton

发表机构 * Queen Mary University of London(伦敦女王学院) Singapore University of Technology and Design(新加坡科技设计大学) Politecnico di Milano(米兰理工大学) EmotionWave(情绪波)

AI总结 本文提出 Text2Score,一个用于从自然语言提示生成乐谱的两阶段框架,旨在解决文本驱动符号音乐生成中数据稀缺和自动标注不可靠的问题。该方法通过直接从符号化 XML 数据中提取监督信号,绕过了传统文本-音乐对的噪声和稀疏性问题,分为规划阶段和执行阶段:规划阶段利用大语言模型生成结构化的乐谱计划,执行阶段则生成符合该计划的 ABC 符号乐谱。实验表明,Text2Score 在可玩性、可读性等多个评估维度上均优于现有方法,并开源了数据集、代码及评估工具。

Comments 8 pages including references, 1 figure

详情
英文摘要

Developing text-driven symbolic music generation models remains challenging due to the scarcity of aligned text-music datasets and the unreliability of automated captioning pipelines. While most efforts have focused on MIDI, sheet music representations are largely underexplored in text-driven generation. We present Text2Score, a two-stage framework comprising a planning stage and an execution stage for generating sheet music from natural language prompts. By deriving supervision signals directly from symbolic XML data, we propose an alternative training paradigm that bypasses noisy or scarce text-music pairs. In the planning stage, an LLM orchestrator translates a natural language prompt into a structured measure-wise plan defining musical attributes such as instruments, key, time signatures, harmony, etc. This plan is then consumed by a generative model in the execution stage to produce interleaved ABC notation conditioned on the plan's structural constraints. To assess output quality, we introduce an evaluation framework covering playability, readability, instrument utilization, structural complexity, and prompt adherence, validated by expert musicians. Text2Score consistently outperforms both a pure LLM-based agentic framework and three end-to-end baselines across objective and subjective dimensions. We open-source the dataset, code, evaluation set and LLM prompts used in this work; a demo is available on our project page (https://keshavbhandari.github.io/portfolio/text2score).

2605.13404 2026-05-14 cs.SD 版本更新

Seconds-Aligned PCA-DAC Latent Diffusion for Symbolic-to-Audio Drum Rendering

Konstantinos Soiledis, Maximos Kaliakatsos Papakostas, Dimos Makris, Konstantinos Tsamis

发表机构 * Dept. of Music Technology and Acoustics, Hellenic Mediterranean University(音乐技术与声学系,希腊地中海大学) Athena RC(雅典研究中心)

AI总结 该研究提出了一种名为Sec2Drum-DAC的条件潜扩散模型,用于从符号控制信息生成鼓声音频。该模型通过在物理时间点采样事件特征,并预测冻结DAC编码本嵌入的主成分坐标,而非直接生成波形样本,从而在保持节奏和力度信息的同时生成逼真的音频。实验表明,该方法在多个评估指标上优于确定性PCA回归和符号渲染基线,尤其在音谱和瞬态特性方面表现突出。

详情
英文摘要

Symbolic-control drum generation requires preserving explicit event timing and dynamics while synthesizing acoustically plausible waveforms. We present Sec2Drum-DAC, a conditional latent-diffusion model for symbolic-to-audio drum rendering. The model conditions on event features sampled in physical time at codec-frame locations and predicts standardized principal-component coordinates of frozen DAC summed-codebook embeddings rather than waveform samples. In the evaluated DAC configuration, 72 principal components capture the observed training-frame summed-latent subspace under the stated SVD threshold, yielding a compact continuous denoising target with a deterministic reconstruction path to the 1024-dimensional DAC latent space before waveform decoding. Across 1,733 held-out four-beat windows, PCA diffusion improves paired spectral and transient metrics over deterministic PCA regression and a symbolic rendering baseline, while direct regression remains stronger on phase-sensitive waveform L1. Auxiliary RVQ cross-entropy improves short-step diffusion on mel error, onset-flux cosine, and waveform L1, with the most favorable trade-offs occurring at 6-25 denoising steps depending on the metric.

2603.02245 2026-05-14 eess.AS cs.LG cs.SD 版本更新

LMU-Based Sequential Learning and Posterior Ensemble Fusion for Cross-Domain Infant Cry Classification

Niloofar Jazaeri, Hilmi R. Dajani, Marco Janeczek, Martin Bouchard

发表机构 * University of Ottawa(渥太华大学) Crynostics Inc.(Crynostics公司)

AI总结 本文研究了跨领域婴儿哭声分类问题,针对信号非平稳、标注有限及领域差异大的挑战,提出了一种融合MFCC、STFT和基频特征的紧凑声学框架,并采用增强的Legendre记忆单元(LMU)建模时序动态。通过引入校准的后验集成融合方法,有效提升了模型在不同数据集上的泛化能力,实验表明该方法在跨域评估中取得了更好的宏F1分数,并具备实时部署的可行性。

Comments 7 pages, to appear in Proc. Int. Conf. IEEE Engineering in Medicine and Biology Society (EMBC 2026), Toronto, Canada, July 26-30 2026

详情
英文摘要

Decoding infant cry causes remains challenging for healthcare monitoring due to short nonstationary signals, limited annotations, and strong domain shifts across infants and datasets. We propose a compact acoustic framework that fuses mel-frequency cepstral coefficients (MFCCs), short-time Fourier transform (STFT) features, and fundamental-frequency (F0) contours within a multi-branch convolutional neural network (CNN) encoder, and models temporal dynamics using an enhanced Legendre Memory Unit (LMU). Compared to LSTMs, the LMU backbone provides stable sequence modeling with substantially fewer recurrent parameters, supporting efficient deployment. To improve cross-dataset generalization, we introduce calibrated posterior ensemble fusion with entropy-gated weighting to preserve domain-specific expertise while mitigating dataset bias. Experiments on Baby2020 and Baby Crying demonstrate improved macro-F1 under cross-domain evaluation, along with leakage aware splits and real-time feasibility for on-device monitoring.

2602.16253 2026-05-14 eess.AS cs.SD 版本更新

How Much Does Machine Identity Matter in Anomalous Sound Detection at Test Time?

Kevin Wilkinghoff, Keisuke Imoto, Zheng-Hua Tan

发表机构 * Aalborg University(奥胡斯大学) Pioneer Centre for Artificial Intelligence(先锋人工智能中心) Kyoto University(京都大学)

AI总结 本文研究了在测试阶段缺乏机器身份信息时,对异常声音检测(ASD)性能的影响。作者提出了一种修改后的评估方法,将多台机器的测试录音合并处理,不依赖机器身份进行推理,仅在事后评估中使用身份标签。实验表明,这种方法揭示了传统评估下隐藏的性能下降和方法鲁棒性差异,并发现这些下降与模型隐含的机器识别准确性密切相关。

详情
英文摘要

Anomalous sound detection (ASD) benchmarks typically assume that the identity of the monitored machine is known at test time and that recordings are evaluated in a machine-wise manner. However, in realistic monitoring scenarios with multiple known machines operating concurrently, test recordings may not be reliably attributable to a specific machine, and requiring machine identity imposes deployment constraints such as dedicated sensors per machine. To reveal performance degradations and method-specific differences in robustness that are hidden under standard machine-wise evaluation, we consider a minimal modification of the ASD evaluation protocol in which test recordings from multiple machines are merged and evaluated jointly without access to machine identity at inference time. Training data and evaluation metrics remain unchanged, and machine identity labels are used only for post hoc evaluation. Experiments with representative ASD methods show that relaxing this assumption reveals performance degradations and method-specific differences in robustness that are hidden under standard machine-wise evaluation, and that these degradations are strongly related to implicit machine identification accuracy.

2512.20211 2026-05-14 cs.SD eess.AS eess.SP 版本更新

Aliasing-Free Neural Audio Synthesis

Yicheng Gu, Junan Zhang, Chaoren Wang, Jerry Li, Zhizheng Wu, Lauri Juvela

发表机构 * Aalto University School of Science(阿alto大学科学学院) Aalto University(阿alto大学) School of Data Science, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)数据科学学院) Spellbrush, Akihabara, Tokyo(东京秋叶原Spellbrush)

AI总结 在神经音频合成中,现有模型在生成高质量音乐和人声演唱时常因非线性激活函数和上采样层引入严重的混叠伪影而表现不足。本文将可微分的抗混叠技术引入激活和上采样模块,提出Pupu-Vocoder和Pupu-Codec模型,有效提升了音频重建质量。实验表明,新模型在音乐、人声演唱和通用音频任务中优于现有系统,在语音任务上也保持了相近性能。

Comments Accepted by TASLP

详情
英文摘要

In neural audio synthesis, neural vocoders and codecs are models that reconstruct waveforms from acoustic and latent representations, which are essential to the resulting audio quality. While current models are capable of generating perceptually natural speech, they still struggle with high-fidelity music and singing voice synthesis, as severe aliasing artifacts are introduced by non-linear activation functions and upsampling layers in existing architectures. Although various anti-aliasing techniques have been proposed in digital signal processing, their integration into neural vocoders and codecs remains under-explored. This paper incorporates differentiable anti-aliasing techniques into the activation and upsampling modules to bridge this gap, and thus presents Pupu-Vocoder and Pupu-Codec. We build a test signal benchmark to evaluate the anti-aliased modules, and validate our proposed models on speech, singing voice, music, and audio. Experimental results show that Pupu-Vocoder and Pupu-Codec outperform existing systems on singing voice, music, and audio, while achieving comparable performance on speech. Demos, codes, and checkpoints are available at VocodexElysium.github.io/AliasingFreeNeuralAudioSynthesis/.

2502.20427 2026-05-14 cs.CR cs.AI cs.SD eess.AS 版本更新

DeePen: Penetration Testing for Audio Deepfake Detection

Nicolas Müller, Piotr Kawa, Adriana Stan, Thien-Phuc Doan, Souhwan Jung, Wei Herng Choong, Philip Sperl, Konstantin Böttinger

发表机构 * Technical University of Cluj-Napocay(克卢日-纳波卡技术大学) AISRC, Soongsil University(Soongsil大学人工智能研究中心)

AI总结 本文提出了一种名为DeePen的系统化渗透测试方法,用于评估基于机器学习的深度伪造音频检测分类器的鲁棒性。该方法无需了解或接触目标检测模型,而是通过一系列精心设计的信号处理攻击来测试模型的漏洞。研究发现,无论是实际部署的系统还是公开的学术模型,均存在可被简单操作(如时间拉伸或添加回声)欺骗的弱点,表明当前的深度伪造检测技术仍面临严峻挑战。

详情
英文摘要

Deepfakes - manipulated or forged audio and video media - pose significant security risks to individuals, organizations, and society at large. To address these challenges, machine learning-based classifiers are commonly employed to detect deepfake content. In this paper, we assess the robustness of such classifiers through a systematic penetration testing methodology, which we introduce as DeePen. Our approach operates without prior knowledge of or access to the target deepfake detection models. Instead, it leverages a set of carefully selected signal processing modifications - referred to as attacks - to evaluate model vulnerabilities. Using DeePen, we analyze both real-world production systems and publicly available academic model checkpoints, demonstrating that all tested systems exhibit weaknesses and can be reliably deceived by simple manipulations such as time-stretching or echo addition. Furthermore, our findings reveal that while some attacks can be mitigated by retraining detection systems with knowledge of the specific attack, others remain persistently effective.

2605.13099 2026-05-14 cs.SD 版本更新

Bypassing Direct Reconstruction: Speech Detection from MEG via Large-Scale Audio Retrieval

Boda Xiao, Bo Wang, Heping Cheng

发表机构 * Center for BioMed-X Research, Academy for Advanced Interdisciplinary Studies, Peking University(北京大学生物医学交叉研究学院,先进跨学科研究学院) Speech and Hearing Research Center, School of Intelligence Science and Technology, Peking University(北京大学智能科学与技术学院语音听力研究中心) State Key Laboratory of General Artificial Intelligence, Beijing, China(一般人工智能国家重点实验室,中国北京) National Biomedical Imaging Center, State Key Laboratory of Membrane Biology, Institute of Molecular Medicine, Peking-Tsinghua Center for Life Sciences, College of Future Technology, Peking University(国家生物医学成像中心,膜生物学国家重点实验室,分子医学研究院,北京大学-清华大学生命科学学院,未来技术学院,北京大学)

AI总结 本文研究如何从非侵入式脑信号(MEG)中检测语音内容,提出了一种无需直接重建语音信号的新方法。该方法首先利用对比学习模型从大规模音频库中检索与测试MEG信号匹配的语音片段,再通过语音检测模型生成静音与语音的二值序列。该方法在LibriBrain 2025语音检测任务中取得了优异成绩,验证了借助外部音频数据库进行语音检测的有效性。

Comments ranked first at LibriBrain Competition 2025 https://neural-processing-lab.github.io/2025-libribrain-competition/prizes/

详情
英文摘要

Decoding speech from non-invasive brain signals is challenging. For the LibriBrain 2025 Speech Detection task, we propose a novel two-step framework that bypasses direct reconstruction. First, a contrastive learning model retrieves the matching speech segment for the given test MEG from a large-scale audio library (LibriVox). Second, a speech detection model generates the binary silence/speech sequence directly from this retrieved audio. With this approach, our team Sherlock Holmes achieved first place in the extended track (F1-score: 0.962), demonstrating that leveraging external audio databases is a highly effective strategy.

2603.05094 2026-05-14 cs.SD 版本更新

TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling

Hao-Hui Xie, Ho-Lam Chung, Yi-Cheng Lin, Ke-Han Lu, Wenze Ren, Xie Chen, Hung-yi Lee

发表机构 * National Taiwan University(国立台湾大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出TW-Sound580K,一个通过验证-生成-批评(VGC)流程构建的台湾地区音频-文本指令数据集,旨在解决大型音频-语言模型在处理本地化方言韵律时因缺乏专用语料而表现不佳的问题。该数据集利用双ASR验证筛选出522,000个原始音频片段,并扩展为580,000对高质量指令对。基于该数据集训练的Tai-LALM模型在TAU基准测试中取得了49.1%的准确率,较零样本基线提升了6.5%,验证了结合区域性语料与严格筛选及动态仲裁策略对提升本地化语音任务性能的有效性。

详情
英文摘要

Large Audio-Language Models (LALMs) typically struggle with localized dialectal prosody due to the scarcity of specialized corpora. We present TW-Sound580K, a Taiwanese audio-text instruction dataset developed through a Verify-Generate-Critique (VGC) protocol. This pipeline leverages Dual-ASR validation to filter 522K raw clips, subsequently expanding them into 580,000 high-fidelity instruction pairs using a teacher model. The dataset's utility is demonstrated through Tai-LALM, which fine-tunes a DeSTA 2.5-Audio-initialized backbone and incorporates a dynamic Dual-ASR Arbitration strategy to optimize transcription selection during inference. On the TAU Benchmark, Tai-LALM reaches 49.1% accuracy, marking a 6.5% absolute improvement over the zero-shot baseline (42.6% with ASR text conditioning). This confirms that integrating regional corpora with rigorous curation and dynamic arbitration significantly enhances LALM performance on localized speech.

2601.22792 2026-05-14 eess.AS cs.CL cs.SD 版本更新

CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR

Muhammad Shakeel, Yosuke Fukumoto, Chikara Maeda, Chyi-Jiunn Lin, Shinji Watanabe

发表机构 * Honda Research Institute Japan Co., Ltd.(本田研究院日本株式会社) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出了一种名为CALM的联合上下文声学-语言建模框架,用于多说话人自动语音识别(ASR)的个性化处理。该方法通过说话人嵌入驱动的目标说话人提取和基于动态词汇表的上下文偏置,实现了声学与语言线索的联合建模。实验结果表明,CALM在英语和日语的混合语音数据集上显著降低了有偏错误率,验证了其在多语言场景下的有效性。

Comments Accepted to IEEE ICASSP 2026

详情
英文摘要

We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). In personalized AI scenarios, the joint availability of acoustic and linguistic cues naturally motivates the integration of target-speaker conditioning with contextual biasing in overlapping conversations. CALM implements this integration in an end-to-end framework through speaker embedding-driven target-speaker extraction and dynamic vocabulary-based contextual biasing. We evaluate CALM on simulated English (LibriSpeechMix) and Japanese (Corpus of Spontaneous Japanese mixtures, CSJMix). On two-speaker mixtures, CALM reduces biased word error rate (B-WER) from 12.7 to 4.7 on LibriSpeech2Mix and biased character error rate (B-CER) from 16.6 to 8.4 on CSJMix2 (eval3), demonstrating the effectiveness of joint acoustic-linguistic modeling across languages. We additionally report results on the AMI corpus (IHM-mix condition) to validate performance on standardized speech mixtures.

2508.20474 2026-05-14 eess.AS cs.CL cs.SD 版本更新

Unifying Diarization, Separation, and ASR with Multi-Speaker Encoder

Muhammad Shakeel, Yui Sudo, Yifan Peng, Chyi-Jiunn Lin, Shinji Watanabe

发表机构 * Honda Research Institute Japan, Japan(本田研究院日本) Carnegie Mellon University, USA(卡内基梅隆大学)

AI总结 本文提出了一种统一的多说话人编码器(UME),通过共享的语音基础编码器同时学习说话人分轨(SD)、语音分离(SS)和多说话人语音识别(ASR)任务的表示。该方法利用UME多层隐藏表示的残差加权求和编码(RWSE),有效融合不同语义层次的信息,增强任务间的对齐与协同。实验表明,UME在LibriMix数据集上显著优于单独训练的基线模型,尤其在SD任务上取得了1.37%和2.29%的分轨错误率,优于先前研究结果。

Comments Accepted to IEEE ASRU 2025

详情
英文摘要

This paper presents a unified multi-speaker encoder (UME), a novel architecture that jointly learns representations for speaker diarization (SD), speech separation (SS), and multi-speaker automatic speech recognition (ASR) tasks using a shared speech foundational encoder. We leverage the hidden representations from multiple layers of UME as a residual weighted-sum encoding (RWSE) to effectively use information from different semantic levels, contributing to bottom-up alignment between tasks. This joint training approach captures the inherent interdependencies among the tasks, enhancing overall performance on overlapping speech data. Our evaluations demonstrate that UME substantially improves over the single-task baselines dedicated to SD, SS, and multi-speaker ASR on LibriMix evaluation sets. Notably, for SD, UME outperforms the previous studies, achieving diarization error rates of 1.37% and 2.29% on Libri2Mix and Libri3Mix evaluation sets, respectively.

2411.15913 2026-05-14 cs.SD cs.AI cs.LG eess.AS 版本更新

Repurposing Image Diffusion Models for Training-Free Music Style Transfer on Mel-spectrograms

Heehwan Wang, Joonwoo Kwon, Sooyoung Kim, Jungwoo Seo, Shinjae Yoo, Yuewei Lin, Jiook Cha

发表机构 * Seoul National University(首尔国立大学) Michigan State University(密歇根州立大学) Rutgers University(罗格斯大学) Brookhaven National Laboratory(布鲁克海文国家实验室)

AI总结 该研究提出了一种无需训练的音乐风格迁移方法Stylus,通过复用预训练的图像扩散模型,在梅尔频谱图域实现音乐风格迁移。该方法将音频视为结构化的时频图像,通过注入风格键值对操控自注意力机制,同时保留源音频的结构查询,从而在保持内容结构的同时实现风格迁移。实验表明,Stylus在内容保留和感知质量上均优于现有方法,验证了通用图像先验在结构化梅尔频谱图无训练迁移中的有效性。

Comments Accepted by ICIP 2026

详情
英文摘要

Music style transfer blends source structure with reference style to enable personalized music creation. However, existing zero-shot methods often struggle to capture fine-grained audio nuances, relying on coarse text descriptions or requiring expensive task-specific training. We propose Stylus, a training-free framework that repurposes pretrained image diffusion models for music style transfer in the Mel-spectrogram domain. By treating audio as structured time-frequency images, Stylus manipulates self-attention by injecting style keys and values while preserving source structural queries. To ensure high fidelity, we introduce a phase-preserving reconstruction strategy to mitigate spectrogram inversion artifacts, alongside a classifier-free-guidance-inspired control for adjustable stylization. Extensive evaluations including 2,925 human ratings demonstrate that Stylus outperforms state-of-the-art baselines, achieving 34.1% higher content preservation and 25.7% better perceptual quality. Our work validates that generic image priors can be effectively leveraged for the training-free transformation of structured Mel-spectrograms. Code and materials are available at https://github.com/Sooyyoungg/Stylus.git.