arXivDaily arXiv每日学术速递 周一至周五更新
2605.10494 2026-05-12 cs.SD cs.AI 版本更新

Multi-layer attentive probing improves transfer of audio representations for bioacoustics

Marius Miron, David Robinson, Masato Hagiwara, Titouan Parcollet, Jules Cauzinille, Gagan Narula, Milad Alizadeh, Ellen Gilsenan-McMahon, Sara Keen, Emmanuel Chemla, Benjamin Hoffman, Maddie Cusimano, Diane Kim, Felix Effenberger, Jane K. Lawton, Aza Raskin, Olivier Pietquin, Matthieu Geist

发表机构 * Earth Species Project(地球物种项目)

AI总结 本文研究了不同探针策略对生物声学任务中音频表征迁移性能的影响,提出使用多层注意力探针可以更有效地利用时间信息,提升模型在下游任务中的表现。研究对比了线性探针和注意力探针在多个生物声学基准上的性能,发现多层探针优于传统的单层探针,尤其在Transformer模型中,注意力探针显著优于线性探针。该工作为评估和提升音频表征的可迁移性提供了新的方法和见解。

详情
英文摘要

Probing heads map the representations learned from audio by a machine learning model to downstream task labels and are a key component in evaluating representation learning. Most bioacoustic benchmarks use a fixed, low-capacity probe, such as a linear layer on the final encoder layer. While this standardization enables model comparisons, it may bias results by overlooking the interaction between encoder features and probe design. In this work, we systematically study different probing strategies across two bioacoustic benchmarks, BEANs and BirdSet. We evaluate last- and multi-layer probing, across linear and attention probes. We show that larger probe heads that leverage time information have superior performance. Our results suggest that current benchmarks may misrepresent encoder quality when relying on a last-layer probing setup. Multi-layer probing improves downstream task performance across all tested models, while attention probing has superior performance to linear probing for transformer models.

2605.10281 2026-05-12 cs.SD cs.AI 版本更新

Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs

Konstantinos Soiledis, Maximos Kaliakatsos-Papakostas, Dimos Makris, Konstantinos Tsamis

发表机构 * Dept. of Music Technology and Acoustics, Hellenic Mediterranean University(音乐技术与声学系,希腊地中海大学)

AI总结 本文研究如何从带有微时值和力度信息的表达性鼓点网格(MIDI表示)直接生成逼真的鼓音频,提出了一种基于神经音频编解码器的方法。该方法使用基于Transformer的模型将鼓点网格映射为编解码器的离散码元序列,并通过预训练的编解码器解码器生成波形音频。实验表明,该方法在大型人类鼓演奏数据集E-GMD上表现出良好的音频保真度和音乐对齐性,为鼓点到音频的生成提供了有效途径,并为打击乐合成中的音频码元选择提供了实用参考。

详情
英文摘要

Generating realistic drum audio directly from symbolic representations is a challenging task at the intersection of music perception and machine learning. We propose a system that transforms an expressive drum grid, a time-aligned MIDI representation with microtiming and velocity information, into drum audio by predicting discrete codes of a neural audio codec. Our approach uses a Transformer-based model to map the drum grid input to a sequence of codec tokens, which are then converted to waveform audio via a pre-trained codec decoder. We experiment with multiple state-of-the-art neural codecs, namely EnCodec, DAC, and X-Codec, to assess how the choice of audio representation impacts the quality of the generated drums. The system is trained and evaluated on the Expanded Groove MIDI Dataset, E-GMD, a large collection of human drum performances with paired MIDI and audio. We evaluate the fidelity and musical alignment of the generated audio using objective metrics. Overall, our results establish codec-token prediction as an effective route for drum grid-to-audio generation and provide practical insights into selecting audio tokenizers for percussive synthesis.

2605.10256 2026-05-12 cs.SD cs.AI 版本更新

A Cold Diffusion Approach for Percussive Dereverberation

Dimos Makris, András Barják, Maximos Kaliakatsos-Papakostas

发表机构 * Department of Music Technology(音乐技术系) Acoustics Hellenic Mediterranean University(声学希伯伦地中海大学)

AI总结 本文提出了一种用于打击乐去混响的冷扩散框架,针对当前音频去混响研究主要集中在语音而忽视打击乐信号的问题,通过将混响建模为从无混响信号到混响信号的确定性退化过程,逐步生成混响效果。研究引入了两种逆过程参数化方法,并采用UNet和扩散Transformer作为模型架构,在包含真实和电子鼓录音的数据集上进行训练与评估,实验表明该方法在多个指标上优于现有的基于分数和条件扩散的基线模型。

Comments Accepted for the 2026 IEEE World Congress on Computational Intelligence, IJCNN Track, 21-26 June 2026, Maastricht, the Netherlands

详情
英文摘要

Most recent advances in audio dereverberation focus almost exclusively on speech, leaving percussive and drum signals largely unexplored despite their importance in music production. Percussive dereverberation poses distinct challenges due to sharp transients and dense temporal structure. In this work, we propose a cold diffusion framework for dereverberating stereo drum stems (downmixes), modeling reverberation as a deterministic degradation process that progressively transforms anechoic signals into reverberant ones. We investigate two reverse-process parameterizations, Direct (next-state) and a Delta-normalized residual (velocity-style) prediction, and implement the framework using both a UNet and a diffusion Transformer backbone. The models are trained and evaluated on curated datasets comprising both acoustic and electronic drum recordings, with reverberation generated using a combination of synthetic and real room impulse responses. Extensive experiments on in-domain and fully out-of-domain test sets demonstrate that the proposed method consistently outperforms strong score-based and conditional diffusion baselines, evaluated using signal-based and perceptual metrics tailored to percussive audio.

2605.10203 2026-05-12 cs.SD eess.AS 版本更新

Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration

Haowen Li, Tianxiang Li, Yi Yang, Boyu Cao, Qi Liu

发表机构 * School of Future Technology, South China University of Technology, Guangzhou, China.(未来技术学院,华南理工大学,广州,中国)

AI总结 该研究提出了一种名为Polyphonia的零样本音色迁移框架,旨在解决多声部音乐中对特定音轨进行音色编辑时背景伴奏易被破坏的问题。其核心方法是引入基于声学信息的注意力校准机制,通过概率声学先验建立粗略边界,从而在保持非目标音轨语义完整性的同时,更精确地定位并修改目标音轨。实验表明,该方法在目标音轨对齐度上比现有方法提升了15.5%,同时保持了较高的音乐保真度和非目标音轨的完整性。

Comments Accepted by ICML 2026

详情
英文摘要

The advancement of diffusion-based text-to-music generation has opened new avenues for zero-shot music editing. However, existing methods fail to achieve stem-specific timbre transfer, which requires altering specific stems while strictly preserving the background accompaniment. This limitation severely hinders practical application, since real-world production necessitates precise manipulation of components within dense mixtures. Our key finding is that, while vanilla cross-attention captures semantic features of stems, it lacks the spectral resolution to strictly localize targets in dense mixtures, leading to boundary leakage. To resolve this dilemma, we propose Polyphonia, a zero-shot editing framework with Acoustic-Informed Attention Calibration. Rather than relying solely on diffuse semantic attention, Polyphonia leverages a probabilistic acoustic prior to establish coarse boundaries, enabling non-target stems preserved precise semantic synthesis. For evaluation, we propose PolyEvalPrompts, a standardized prompt set with 1,170 timbre transfer tasks in polyphonic music. Specifically, Polyphonia achieves an increase of 15.5% in target alignment compared to baselines, while maintaining competitive music fidelity and non-target integrity.

2605.10153 2026-05-12 cs.SD cs.LG 版本更新

APEX: Audio Prototype EXplanations for Classification Tasks

Piotr Kawa, Kornel Howil, Piotr Borycki, Miłosz Adamczyk, Przemysław Spurek, Piotr Syga

发表机构 * Department of Artificial Intelligence, Wroclaw University of Science and Technology, Poland(华沙理工大学人工智能系) Resemble AI, USA(Resemble AI公司) IDEAS Research Institute, Poland(波兰IDEAS研究院) Faculty of Mathematics and Computer Science, Jagiellonian University, Poland(雅盖隆大学数学与计算机科学系) Doctoral School of Exact and Natural Sciences, Jagiellonian University, Poland(雅盖隆大学博士学院)

AI总结 本文提出了一种名为APEX的音频分类解释框架,旨在解决当前音频领域可解释AI方法不足的问题。该方法基于预训练音频分类器,无需微调即可生成与原模型输出一致的解释结果。APEX通过将解释分解为时域、频域及时频联合四个视角,提供了更符合音频特性的直观解释,提升了分类结果的语义可理解性。

详情
英文摘要

Explainable AI (XAI) has achieved remarkable success in image classification, yet the audio domain lacks equally mature solutions. Current methods apply vision-based attribution techniques to spectrograms, overlooking fundamental differences between visual and acoustic signals. While prototype reasoning is promising, acoustic similarity remains multidimensional. We introduce APEX (Audio Prototype EXplanations), a post-hoc framework for interpreting pre-trained audio classifiers. Crucially, APEX requires no fine-tuning of the original backbone and strictly preserves output invariance. APEX disentangles explanations into four perspectives: Square-based prototypes to localize transient events, Time-based for temporal patterns, Frequency-based highlighting spectral bands, and Time-Frequency-based integrating both. This yields intuitive, example-based explanations that respect acoustic properties, providing greater semantic clarity than standard gradient-based methods.

2602.10666 2026-05-12 eess.AS cs.LG cs.SD 版本更新

From Diet to Free Lunch: Estimating Auxiliary Signal Properties using Dynamic Pruning Masks in Speech Enhancement Networks

Riccardo Miccini, Clément Laroche, Tobias Piechowiak, Xenofon Fafoutis, Luca Pezzarossa

发表机构 * GN Hearing(GN听力) Technical University of Denmark (DTU)(丹麦技术大学)

AI总结 本文研究了如何在语音增强网络中利用动态通道剪枝(DynCP)生成的内部剪枝掩码来估计辅助信号属性,如语音活动检测(VAD)、噪声分类和基频(F0)估计,从而避免部署额外模型的需求。通过简单的可解释预测器,该方法在多个任务上取得了较高的准确率,且计算开销极小。研究不仅揭示了DynCP模型在下游任务中的学习行为,还提出了将其作为高效语音增强与信号属性联合估计的统一解决方案。

Comments Accepted for publication at the 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

详情
英文摘要

Speech Enhancement (SE) in audio devices is often supported by auxiliary modules for Voice Activity Detection (VAD), SNR estimation, or Acoustic Scene Classification to ensure robust context-aware behavior and seamless user experience. Just like SE, these tasks often employ deep learning; however, deploying additional models on-device is computationally impractical, whereas cloud-based inference would introduce additional latency and compromise privacy. Prior work on SE employed Dynamic Channel Pruning (DynCP) to reduce computation by adaptively disabling specific channels based on the current input. In this work, we investigate whether useful signal properties can be estimated from these internal pruning masks, thus removing the need for separate models. We show that simple, interpretable predictors achieve up to 93% accuracy on VAD, 84% on noise classification, and an R2 of 0.86 on F0 estimation. With binary masks, predictions reduce to weighted sums, inducing negligible overhead. Our contribution is twofold: on one hand, we examine the emergent behavior of DynCP models through the lens of downstream prediction tasks, to reveal what they are learning; on the other, we repurpose and re-propose DynCP as a holistic solution for efficient SE and simultaneous estimation of signal properties.

2509.08031 2026-05-12 cs.SD cs.AI cs.LG eess.AS 版本更新

AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

Hoang Nguyen, Sidharth Surapaneni, Akshay Kalkunte, Jash Mehta, Aman Tiwari, Oluwanifemi Bamgbose, Khyati Mahajan, Jash Shah, Shruthan Radhakrishna, Sathwik Tejaswi Madhusudhan, Vikas Yadav, Sai Rajeswar

发表机构 * ServiceNow University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 随着大音频语言模型(LALMs)的快速发展,其评估工具仍面临效率低、标准化不足等问题,限制了模型的公平比较和系统评估。为此,本文提出AU-Harness,一个高效且全面的评估框架,通过优化的批量处理和并行执行,实现比现有工具快151%的评估速度,并提供标准化的提示协议和灵活配置,支持多轮对话分析,揭示LALMs的真实音频推理能力,推动模型的系统性发展。

详情
英文摘要

Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient and non-standardized toolkits that limit fair comparison and systematic assessment. Existing evaluation frameworks exhibit three critical limitations: (1) slow and inefficient processing pipeline that bottlenecks large-scale studies, (2) inadequate multi-turn dialogue support, leaving fundamental questions about cross-turn context integration and performance dynamics over extended conversations in LALMs unanswered; and (3) the absence of unified and scalable evaluation framework capable of keeping pace with the rapid growth of both LALMs and audio benchmarks. To address these issues, we introduce AU-Harness, an efficient and comprehensive evaluation framework for LALMs. Our system achieves a speedup of up to 151% over existing evaluation toolkits through optimized batch processing and parallel execution, enabling large-scale evaluations previously considered impractical. We provide standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios. AU-Harness unlocks a range of in-depth analyses difficult to conduct without a unified foundation, including multi-turn dialogue dynamics, enabling the study of true audio reasoning capabilities in existing LALMs. AU-Harness provides both practical evaluation tools and insights into model limitations, advancing systematic LALM development.

2507.23511 2026-05-12 eess.AS cs.AI cs.CL cs.SD 版本更新

MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

Yadong Niu, Tianzi Wang, Heinrich Dinkel, Xingwei Sun, Jiahao Zhou, Gang Li, Jizhong Liu, Xunying Liu, Junbo Zhang, Jian Luan

发表机构 * The Chinese University of Hong Kong, Hong Kong, China(香港中文大学)

AI总结 本文提出MECAT,一个多专家构建的细粒度音频理解基准,旨在解决当前音频语言模型在细微理解层面的不足。该基准通过整合专业模型分析与链式推理大语言模型生成多视角、细粒度的描述和开放问答对,并引入新的评估指标DATE,以提升对模型输出细节程度的区分能力。实验表明,MECAT能够更准确地评估现有音频模型在细粒度理解任务中的表现与局限。

Comments Accepted to ICML 2026

详情
英文摘要

While large audio-language models have advanced open-ended audio understanding, they still fall short of nuanced human-level comprehension. This gap persists largely because current benchmarks, limited by data annotations and evaluation metrics, fail to reliably distinguish between generic and highly detailed model outputs. To this end, this work introduces MECAT, a Multi-Expert Constructed Benchmark for Fine-Grained Audio Understanding Tasks. Generated via a pipeline that integrates analysis from specialized expert models with Chain-of-Thought large language model reasoning, MECAT provides multi-perspective, fine-grained captions and open-set question-answering pairs. The benchmark is complemented by a novel metric: DATE (Discriminative-Enhanced Audio Text Evaluation). This metric penalizes generic terms and rewards detailed descriptions by combining single-sample semantic similarity with cross-sample discriminability. A comprehensive evaluation of state-of-the-art audio models is also presented, providing new insights into their current capabilities and limitations. The data and code are available at https://github.com/xiaomi-research/mecat

2605.10084 2026-05-12 eess.AS cs.AI cs.LG cs.SD 版本更新

PoDAR: Power-Disentangled Audio Representation for Generative Modeling

Alejandro Luebs, Mithilesh Vaidya, Ishaan Kumar, Sumukh Badam, Stephen W. Bailey, Matthew Bendel, Jose Sotelo, Xingzhe He

发表机构 * Descript

AI总结 本文提出了一种名为PoDAR的音频表示方法,通过显式地将信号功率与语义内容解耦,显著提升了音频潜在空间的可建模性。该方法利用随机功率增强和潜在一致性目标,使生成模型的收敛速度加快并提升生成质量。实验表明,PoDAR在多个指标上优于基线方法,同时扩展了条件生成的适用范围。

Comments 9 pages, 3 figures

详情
英文摘要

The performance of audio latent diffusion models is primarily governed by generator expressivity and the modelability of the underlying latent space. While recent research has focused primarily on the former, as well as improving the reconstruction fidelity of audio codecs, we demonstrate that latent modelability can be significantly improved through explicit factor disentanglement. We present PoDAR (Power-Disentangled Audio Representation), a framework that utilizes a randomized power augmentation and latent consistency objective to decouple signal power from invariant semantic content. This factorization makes the latent space easier to model, which both accelerates the convergence of downstream generative models and improves final overall performance. When applied to a Stable Audio 1.0 VAE with an F5-TTS generator, PoDAR achieves about a $2\times$ acceleration in convergence to match baseline performance, while increasing final speaker similarity by 0.055 and UTMOS by 0.22 on the LibriSpeech-PC dataset. Furthermore, isolating power into dedicated channels enables the application of CFG exclusively to power-invariant content, effectively extending the stable guidance regime to higher scales.

2605.09908 2026-05-12 cs.LG cs.AI cs.SD 版本更新

Voice Biomarkers for Depression and Anxiety

Oleksii Abramenko, Noah D. Stein, Colin Vaz

发表机构 * Kintsugi Mindful Wellness, Inc.(Kintsugi Mindful Wellness公司)

AI总结 本文研究如何从语音中检测抑郁和焦虑,提出了一种基于深度学习的方法,直接利用原始语音信号进行建模,避免了传统方法中依赖人工设计特征的局限。研究使用了一个包含约65,000条语料、来自23,000名美国代表性人群的大规模数据集进行训练,所提出的模型能够提取与内容无关的生物标志物信息,并与语音中的词汇特征结合,在实际应用中提升了预测性能。实验表明,该模型在约5000名独立测试者上实现了71%的灵敏度和特异性,并已开源发布以促进相关研究。

详情
英文摘要

Current approaches to detecting depression and anxiety from speech primarily rely on machine learning techniques that utilize hand-engineered paralinguistic features and related acoustic descriptors derived from time- and frequency-domain representations of speech signals. Applying deep learning methods directly to raw speech signals has the potential to produce biomarker representations with substantially greater predictive power. However, these approaches typically require large volumes of carefully annotated data to learn robust and clinically meaningful representations of the underlying biomarkers. In this paper, we describe our efforts toward developing a deep learning model trained on a large-scale proprietary dataset comprising ~65,000 utterances collected from more than 23,000 subjects representative of relevant United States demographics. We present the techniques employed and analyze their impact on model performance. Our results demonstrate that the proposed models can extract content-agnostic biomarker information, which, when combined with lexical features extracted from audio, yields improved predictive performance in production settings. Our models are evaluated on ~5000 unique subjects and achieve performance of 71% in terms of sensitivity and specificity. To foster further research in mental health assessment from speech, we release the best-performing model described in this paper on HuggingFace.

2605.09906 2026-05-12 cs.AI cs.SD 版本更新

Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

Xuanchen Li, Yuheng Lu, Chenrui Cui, Tianrui Wang, Zikang Huang, Yu Jiang, Long Zhou, Longbiao Wang, Jianwu Dang

发表机构 * Tianjin Key Laboratory of Cognitive Computing(天津认知计算实验室) Tianjin University(天津大学) Huiyan Technology Company, Ltd.(慧颜科技有限公司) Chinese Academy of Sciences(中国科学院) Tencent(腾讯)

AI总结 该研究针对音频-视觉大语言模型在推理过程中存在的跨模态干扰问题,提出了一种名为“先分离后融合”(SFFL)的新型推理框架。该方法通过强制进行模态特定的推理过程,分别生成音频和视觉的推理轨迹,并在后续阶段整合信息进行回答,从而减少模态间的信息干扰。实验表明,该方法在多个基准测试中显著提升了模型的准确性和鲁棒性。

详情
英文摘要

Audio and vision provide complementary evidence for audio-visual question answering, yet current audio-visual large language models may suffer from cross-modal interference: information from one modality misguides the interpretation of another, thereby inducing hallucinations. We attribute this issue to uncontrolled cross-modal interactions during intermediate reasoning. To mitigate this, we propose Separate First, Fuse Later (SFFL), an audio-visual reasoning framework designed to reduce cross-modal interference. SFFL enforces modality-specific chain-of-thought reasoning, producing separate audio and visual reasoning traces and integrating evidence for answering. We construct modality-preference labels via a data pipeline under different modality input settings. We use these labels as an auxiliary reward in reinforcement learning to encourage a instance-dependent preference for modality cues when answering. We further introduce a modality-specific reasoning mechanism that preserves modality isolation during the separated reasoning stage while enabling full access to cross-modal information at the evidence fusion stage. Experiments demonstrate consistent improvements in both accuracy and robustness, yielding an average relative gain of 5.16\% on general AVQA benchmarks and 11.17\% on a cross-modal hallucination benchmark.

2605.09846 2026-05-12 cs.SD cs.AI 版本更新

ChladniSonify: A Visual-Acoustic Mapping Method for Chladni Patterns in New Media Art Creation

Yakun Liu, Hai Luan, Dong Liu, Zhiyu Jin

发表机构 * Department of Composition(作曲系) Education Information Center(教育信息中心) Department of Musicology(音乐学系)

AI总结 在新媒体艺术创作中,视觉与听觉的映射往往具有主观性。本文提出了一种实时的视觉-听觉映射方法 ChladniSonify,用于生成克拉尼图案(Chladni patterns)的声学映射。该方法基于Kirchhoff-Love板理论构建数据集,并采用轻量级CNN结合CBAM模块实现高精度、低延迟的图案分类,最终在Python和Max/MSP中搭建了端到端系统,将识别出的图案映射到对应的正弦波频率,实现了零偏差的理论频率匹配与实时交互。

Comments 9 pages, 5 figures, IEEE conference format

详情
英文摘要

In new media art creation, the mapping between vision and hearing is often subjective. As a classic carrier of sound visualization, Chladni patterns have great potential in building audio-visual mapping mechanisms. However, existing tools face pain points: high technical barriers for simulation, offline computing failing real-time interaction, and uncontrollable mapping rules in general sonification tools. To address these, this paper proposes ChladniSonify, a real-time visual-acoustic mapping method for Chladni patterns. Based on Kirchhoff-Love plate theory, we build a paired dataset via numerical programming and calibrate it using ANSYS finite element simulation. Focusing on the slender nodal lines of Chladni patterns, we adopt a lightweight CNN with CBAM to achieve high-precision, low-latency pattern classification. Finally, we build an end-to-end system in Python and Max/MSP, mapping recognized patterns to corresponding sine wave frequencies. Results show the system has excellent usability: the classification module achieves 99.33% accuracy on the test set with 7.03 ms inference latency; the mapped frequency matches the theoretical value with zero deviation; the average end-to-end latency is under 50 ms, meeting real-time interactive needs. This work provides a reproducible engineering prototype for Chladni audio-visual art creation.

2605.09259 2026-05-12 cs.SD cs.AI 版本更新

Remix the Timbre: Diffusion-Based Style Transfer Across Polyphonic Stems

Leduo Chen, Junchuan Zhao, Shengchen Li

AI总结 本文研究了如何在多乐器混合音频中实现灵活的音色迁移,即在保持原旋律和节奏的前提下,将不同声部的音色转换为目标乐器。为此,作者提出了MixtureTT,这是首个直接从多乐器混合音频中进行逐声部音色迁移的系统,通过共享的扩散过程同时处理所有声部,有效避免了传统分步处理带来的错误累积和音色不协调问题。实验表明,MixtureTT在客观和主观指标上均优于单乐器方法,验证了跨声部建模在混合音色迁移中的重要性。

详情
英文摘要

Timbre transfer aims to modify the timbral identity of a musical recording while preserving the original melody and rhythm. While single-instrument timbre transfer has made substantial progress, existing approaches to multi-instrument settings rely on separate-then-transfer pipelines that propagate source separation artifacts and produce incoherent synthesized timbres across stems. This paper proposes MixtureTT, to the best of our knowledge the first system for flexible per-stem timbre transfer directly from a polyphonic mixture. Given a mixture and a separate timbre reference for each target voice, MixtureTT jointly transfers all stems to the specified instruments through a shared diffusion process. Modeling the dependencies across the per-stem content and cross-stem harmonic, the proposed joint stem diffusion transformer eliminates cascaded separation error, reduces inference cost by a factor equal to the number of stems, and yields more coherent multi-stem outputs. Despite operating under a strictly harder input condition, evaluations on the SATB choral dataset show that MixtureTT outperforms single-instrument baselines on both objective and subjective metrics demonstrating the necessity of dedicated multi-instrument timbre transfer over the naive separate-then-transfer pipelines. As a result, this work confirms that the cross-stem modeling is essential for mixture-level timbre transfer as the proposed joint setting consistently exceeds an equivalent single-stem ablation.

2605.02948 2026-05-12 cs.LG cs.AI cs.SD 版本更新

AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

Yuxin Lu, Jiayang Sun, Guibo Zhu, Min Cao

AI总结 AsymTalker 是一种基于扩散模型的长时 talking head 生成方法,旨在解决现有方法在长时间视频生成中出现的身份不一致和时空对齐问题。该方法引入了时间参考编码(TRE)和非对称知识蒸馏(AKD),分别用于缓解静态身份参考与动态音频流之间的时空错位,以及解决分块生成过程中身份漂移的问题。实验表明,AsymTalker 在保证高保真度和身份一致性的同时,能够生成长达600秒的视频,并实现每秒66帧的实时推理速度,达到了当前最先进的性能。

详情
英文摘要

Diffusion-based talking head generation has achieved remarkable visual quality, yet scaling it to long-term videos remains challenging. The widely adopted chunk-wise paradigm introduces two fundamental failures: (1) temporal-spatial misalignment between static identity references and dynamic audio streams, and (2) cascading identity drift propagated through self-generated continuity references across chunks. To address both issues, we propose AsymTalker, a novel diffusion-based talking head generation method comprising Temporal Reference Encoding (TRE) and Asymmetric Knowledge Distillation (AKD). First, TRE mitigates temporal-spatial misalignment by transforming the static identity image into a temporally coherent latent representation through encoding of a temporally replicated pseudo-video, without introducing additional parameters. Second, AKD resolves the inherent conditioning dilemma in chunk-wise training: using ground-truth references causes train-inference mismatch, while self-generated references entangle supervision with identity drift. Our asymmetric design circumvents this by anchoring the teacher model with ground-truth continuity references to provide drift-free, chunk-level supervision, thereby avoiding the teacher bottleneck. Meanwhile, the student model learns under inference-aligned conditions, conditioned only on self-generated references, and is trained via distribution matching to preserve identity over long horizons. Extensive experiments show AsymTalker achieves state-of-the-art results on HDTF and VFHQ. It guarantees high-fidelity, identity-consistent synthesis over 600-second videos and reaches a real-time inference speed of 66 FPS.

2601.12248 2026-05-12 eess.AS cs.AI cs.CL cs.LG cs.SD 版本更新

AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering

Chun-Yi Kuan, Hung-yi Lee

AI总结 AQUA-Bench 是一个用于评估音频问答中不可答问题识别能力的新基准,旨在弥补现有评测体系对不可答问题关注不足的缺陷。该基准通过三个场景系统性地评估模型在缺失答案、答案与问题类别不匹配以及问题与音频内容无关等情况下的表现,从而更全面地衡量模型的可靠性与鲁棒性。实验表明,尽管现有模型在可答任务上表现良好,但在处理不可答问题时仍面临显著挑战,揭示了当前音频语言理解中的一个盲区。

Comments Accepted to ICASSP 2026 (Oral). Project Website: https://github.com/kuan2jiu99/aqua-bench

详情
英文摘要

Recent advances in audio-aware large language models have shown strong performance on audio question answering. However, existing benchmarks mainly cover answerable questions and overlook the challenge of unanswerable ones, where no reliable answer can be inferred from the audio. Such cases are common in real-world settings, where questions may be misleading, ill-posed, or incompatible with the information. To address this gap, we present AQUA-Bench, a benchmark for Audio Question Unanswerability Assessment. It systematically evaluates three scenarios: Absent Answer Detection (the correct option is missing), Incompatible Answer Set Detection (choices are categorically mismatched with the question), and Incompatible Audio Question Detection (the question is irrelevant or lacks sufficient grounding in the audio). By assessing these cases, AQUA-Bench offers a rigorous measure of model reliability and promotes the development of audio-language systems that are more robust and trustworthy. Our experiments suggest that while models excel on standard answerable tasks, they often face notable challenges with unanswerable ones, pointing to a blind spot in current audio-language understanding.

2601.02954 2026-05-12 cs.SD cs.AI 版本更新

The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models

Yuhuan You, Lai Wei, Xihong Wu, Tianshu Qu

AI总结 这篇论文提出了一个名为“The World is Not Mono (TWNM)”的框架,旨在增强大型音频-语言模型对声音事件空间位置的理解能力。研究通过引入基于物理原理的First-Order Ambisonics(FOA)模拟,结合多通道音频学习空间感知表示,并融合语义特征,从而实现对声音场景的多层次分析。该方法在构建的基准测试中表现出色,显著提升了模型在空间定位、场景推理等任务上的性能。

Comments 25 pages, 4 figures

详情
英文摘要

Large audio-language models have made rapid progress in recognizing what is present in an audio clip, but spatial audio-language understanding still lacks a clear task interface. A model must also decide where sound events occur, which semantic and spatial attributes belong to the same auditory object, how multiple objects are arranged, and whether a scene-level answer is physically plausible. We formalize this capability as audio scene analysis (ASA), a three-level problem spanning atomic perception, relational integration, and cognitive reasoning. We propose The World is Not Mono (TWNM), a framework that equips audio-language models with explicit spatial evidence. TWNM uses physically grounded First-Order Ambisonics (FOA) simulation for controllable supervision, learns slot-regularized spatial representations from multichannel audio, fuses them with semantic audio features, and trains with a progressive curriculum ending in preference optimization over metadata-derived answers and auxiliary format/evidence rewards. To operationalize ASA, we build a controlled benchmark from scene metadata, covering localization, attribute binding, spatial comparison, scene abduction, and counterfactual reasoning. On this benchmark, TWNM achieves 70.8% overall accuracy, 66.4% on spatial-family tasks, and 79.76% on mixed L3 scene-level multiple-choice QA. We also audit monaural and binaural reference systems as diagnostic references with explicit audit labels, since they differ in spatial input, training interface, and output format. The supported claim is that a clearly defined ASA hierarchy, FOA-conditioned spatial representations, and metadata-grounded training enable controlled, auditable spatial audio-language reasoning, with STARSS23 providing a limited real-recording diagnostic.

2511.17879 2026-05-12 cs.LG cs.SD 版本更新

Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

Yusong Wu, Stephen Brade, Aleksandra Teng Ma, Tia-Jane Fowler, Enning Yang, Berker Banar, Aaron Courville, Natasha Jaques, Cheng-Zhi Anna Huang

AI总结 本文研究了在实时人机音乐协作中,如何通过生成对抗后训练方法缓解强化学习后训练中的奖励黑客问题。作者提出了一种对抗性训练方法,在策略生成的轨迹上进行训练,以提升旋律到和声伴奏生成的多样性与适应性。实验表明,该方法有效提高了输出多样性、和声连贯性以及用户的互动体验。

Comments v3: fix the Figure numbering bugs

详情
英文摘要

Most applications of generative AI involve a sequential interaction in which a person inputs a prompt and waits for a response, and where reaction time and adaptivity are not important factors. In contrast, live jamming is a collaborative interaction that requires real-time coordination and adaptation without access to the other player's future moves, while preserving diversity to sustain a creative flow. Reinforcement learning post-training enables effective adaptation through on-policy interaction, yet it often reduces output diversity by exploiting coherence-based rewards. This collapse, known as ``reward hacking'', affects many RL post-training pipelines, but is especially harmful in live jamming, where musical creativity relies on dynamic variation and mutual responsiveness. In this paper, we propose a novel adversarial training method on policy-generated trajectories to mitigate reward hacking in RL post-training for melody-to-chord accompaniment. A co-evolving discriminator separates policy trajectories from the data distribution, while the policy maximizes the discriminator output in addition to coherence rewards to prevent collapse to trivial outputs. We evaluate accompaniment quality and output diversity in simulation with both fixed test melodies and learned melody agents, and we conduct a user study with the model deployed in a real-time interactive system with expert musicians. Quantitative evaluation and user feedback demonstrate improved output diversity, harmonic coherence, adaptation speed and user agency. Our results demonstrate a simple yet effective method to mitigate reward hacking in RL post-training of generative sequence models.

2510.19414 2026-05-12 eess.AS cs.AI cs.SD 版本更新

EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection

Tong Zhang, Yihuan Huang, Yanzhen Ren

AI总结 随着语音深度伪造技术的广泛应用,电话诈骗和身份盗用等现实场景中的安全问题日益严重。现有反欺骗系统在实验室合成语音上表现良好,但在面对物理重放攻击时性能显著下降。为此,本文提出了EchoFake数据集,包含超过120小时、来自13000多名说话人的语音数据,涵盖先进的零样本文本到语音合成语音和多种设备及真实环境下的物理重放录音,有效提升了语音深度伪造检测模型的泛化能力与实际应用表现。

Comments ICASSP 2026

详情
英文摘要

The growing prevalence of speech deepfakes has raised serious concerns, particularly in real-world scenarios such as telephone fraud and identity theft. While many anti-spoofing systems have demonstrated promising performance on lab-generated synthetic speech, they often fail when confronted with physical replay attacks-a common and low-cost form of attack used in practical settings. Our experiments show that models trained on existing datasets exhibit severe performance degradation, with average accuracy dropping to 59.6% when evaluated on replayed audio. To bridge this gap, we present EchoFake, a comprehensive dataset comprising more than 120 hours of audio from over 13,000 speakers, featuring both cutting-edge zero-shot text-to-speech (TTS) speech and physical replay recordings collected under varied devices and real-world environmental settings. Additionally, we evaluate three baseline detection models and show that models trained on EchoFake achieve lower average EERs across datasets, indicating better generalization. By introducing more practical challenges relevant to real-world deployment, EchoFake offers a more realistic foundation for advancing spoofing detection methods.

2509.20799 2026-05-12 cs.HC cs.SD 版本更新

AuthGlass: Benchmarking Voice Liveness Detection and Authentication on Smart Glasses via Comprehensive Acoustic Features

Weiye Xu, Zhang Jiang, Siqi Zheng, Xiyuxing Zhang, Changhao Zhang, Jian Liu, Weiqiang Wang, Yuntao Wang

AI总结 随着智能眼镜的快速发展,语音交互因其自然性和便捷性被广泛应用,但其实际应用常受到欺骗攻击的威胁,且目前缺乏针对智能眼镜场景的语音活体检测与认证的公开数据集。为此,研究者收集了一个包含42名受试者16通道音频数据及两类攻击样本的多模态声学数据集,并提出了基于声场的活体检测方法AuthG-Live和多模态认证模型AuthG-Net。实验表明,该方法在四个基准任务中达到最先进水平,并通过消融实验验证了其在真实场景下的泛化能力,研究还发布了名为AuthGlass的数据集以推动相关领域的发展。

Comments Submitted to IMWUT 2026

详情
英文摘要

With the rapid advancement of smart glasses, voice interaction has been widely adopted due to its naturalness and convenience. However, its practical deployment is often undermined by vulnerability to spoofing attacks, while no public dataset currently exists for voice liveness detection and authentication in smart-glasses scenarios. To address this challenge, we first collect a multi-acoustic-modal dataset comprising 16-channel audio data from 42 subjects, along with corresponding attack samples covering two attack categories. Based on insights derived from this collected data, we propose AuthG-Live, a sound-field-based voice liveness detection method, and AuthG-Net, a multi-acoustic-modal authentication model. We further benchmark seven voice liveness detection methods and four authentication methods across diverse acoustic modalities. The results demonstrate that our proposed approach achieves state-of-the-art performance on four benchmark tasks, and extensive ablation studies validate the generalizability of our methods \red{under real-world constraints}. Finally, we release this dataset, termed AuthGlass, to facilitate future research on voice liveness detection and authentication for smart glasses.

2405.09570 2026-05-12 eess.SP cs.LG cs.SD eess.AS 版本更新

FunnelNet: An End-to-End Deep Learning Framework to Monitor Digital Heart Murmur in Real-Time

Md Jobayer, Md. Mehedi Hasan Shawon, Md Zakir Hossain, Shreya Ghosh, Imre Rudas, Tom Gedeon, Md Rakibul Hasan

AI总结 本文提出了一种端到端的深度学习框架 FunnelNet,用于实时监测数字心音杂音。该方法结合传统滤波和深度可分离卷积网络,通过 Butterworth 滤波器和连续小波变换提取心音特征,并采用压缩、瓶颈和扩张三个网络模块实现高效特征学习。实验表明,该模型在儿科心音数据集上以仅 5.4k 参数取得了 85% 的准确率和 92% 的特异性,且在资源受限设备上实现了高实时检测性能,为医疗资源匮乏地区的便捷诊断提供了有效方案。

详情
英文摘要

Heart murmurs are abnormal sounds caused by turbulent blood flow in the heart. Several diagnostic methods are available to detect heart murmurs and their severity, including cardiac auscultation, echocardiography, and phonocardiography (PCG). However, these methods have limitations, including the need for extensive training among healthcare providers, the cost and accessibility of echocardiography, and noise interference during PCG data processing. This study proposes an end-to-end real-time heart murmur detection approach using traditional and depthwise separable convolutional networks. We applied a Butterworth filter and Continuous Wavelet Transform (CWT) to eliminate noise and extract meaningful features from the PCG data. The proposed network consists of three parts: a Squeeze net that generates a compressed data representation, a Bottleneck layer that minimizes computational complexity using depthwise-separable convolutions, and an Expansion net that up-samples the data to capture fine details. We evaluated our model on the publicly available CirCor pediatric heart sound dataset. Using only $\sim$5.4k parameters, we achieved an accuracy of 85%, a sensitivity of 85%, and a specificity of 92%, successfully outperforming several larger models. Furthermore, we converted our network into a TinyML format and tested it on two resource-constrained devices, achieving an average real-time inference accuracy of 91% on a Raspberry Pi 4B and 80% on an Android smartphone. The proposed lightweight model offers a robust deep learning framework for accurate, real-time heart murmur detection, showing strong promise for accessible medical diagnostics in limited-resource environments. The code is publicly available at https://github.com/jobayer/FunnelNet.

2605.09120 2026-05-12 cs.IR cs.SD 版本更新

Reddit2Deezer: A Scalable Dataset for Real-World Grounded Conversational Music Recommendation

Haven Kim, Julian McAuley

AI总结 当前对话式音乐推荐(CMR)研究面临一个困境:真实对话语料规模有限,而合成语料虽能扩展规模但缺乏自然性。本文提出Reddit2Deezer,一个基于19万个独特{帖子,叶子评论}对构建的现实基础CMR数据集,包含原始版本和重述版本,每个音乐实体均关联Deezer标识符,便于获取音频预览和丰富元数据。该数据集经过人工验证,确保对话质量、物品关联性和重述准确性,为内容驱动的对话推荐研究提供了重要资源。

详情
英文摘要

Conversational music recommendation (CMR) research currently faces a tradeoff between authentic dialogue corpora that are limited in scale and synthesized corpora that scale up but whose conversations are artificially constructed rather than naturally observed. In this paper, we introduce Reddit2Deezer, a reality-grounded CMR resource derived from 190k unique {thread, leaf-comment} pairs. We release the resource in two versions: a raw version that preserves authenticity, and a paraphrased version that maximizes long-term reproducibility. Each musical entity is linked to a Deezer identifier, which provides straightforward access to audio previews and rich metadata (e.g., genre tags, popularity, BPM), opening the door to future research on content-grounded conversational recommendation. A human validation confirms the quality of the dialogues, item grounding, and paraphrases. The dataset is available at https://huggingface.co/datasets/McAuley-Lab/Reddit2Deezer.

2605.09087 2026-05-12 cs.SD cs.LG 版本更新

Towards Trustworthy Audio Deepfake Detection: A Systematic Framework for Diagnosing and Mitigating Gender Bias

Aishwarya Fursule, Shruti Kshirsagar, Anderson R. Avila

AI总结 本文针对音频深度伪造检测系统中性别偏见问题,提出了一种系统性的诊断与缓解框架。研究发现,性别偏差主要源于声学表示差异、特征中的性别泄露以及评估结构的不对称性,而非训练数据不平衡。通过引入新的公平性正则化方法和阈值调整策略,有效减少了不公平性,同时保持检测准确率不受影响,为构建可信的音频深度伪造检测系统提供了重要指导。

Comments Submitted to SMC 2026 conference

详情
英文摘要

Audio deepfake detection systems are increasingly deployed in high-stakes security applications, yet their fairness across demographic groups remains critically underexamined. Prior work measures gender disparity but does not investigate where it comes from or how to fix it systematically. We present the first diagnosis-first framework that identifies bias source before applying targeted mitigation, evaluated on two models, AASIST and Wav2Vec2+ResNet18, on ASVSpoof5. Our diagnosis shows that bias does not stem from imbalanced training data but from acoustic representation differences, gender leakage in learned features, and structural evaluation asymmetry. We test mitigation strategies across in-processing, post-processing and combined families, including novel methods introduced in this work. Adjusting the decision threshold separately per gender reduces unfairness by 54% to 75% at no cost to detection accuracy, and our new epoch-level fairness regularisation method outperforms existing per-batch approaches. Adversarial debiasing succeeds only when gender leakage is localised, and fails when it is diffuse, an outcome correctly predicted by our diagnosis before training. No single method fully closes the fairness gap, confirming that bias sources must be identified before fixes are applied and that fairer benchmark design is equally important

2605.08762 2026-05-12 cs.SD cs.LG 版本更新

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

Tao Yu, yiming ding, Shenghua Chai, Minghui Zhang, Zhongtian Luo, Xinming Wang, Xinlong Chen, Zhaolu Kang, Junhao Gong, Yuxuan Zhou, Haopeng Jin, Zhiqing Cui, Jiabing Yang, YiFan Zhang, Hongzhu Yi, Zheqi He, Xi Yang, Yan Huang, Liang Wang

AI总结 当前跨模态基准主要评估模型在多种模态同时提供的场景下的表现,而从音频出发主动搜索跨模态证据的能力仍鲜有研究。本文提出Omni-DeepSearch,一个以音频驱动的跨模态深度搜索基准,要求模型从给定的音频片段和相关问题中提取线索,调用文本、图像和视频检索工具,进行多跳推理生成简短、客观且可验证的答案。该基准包含640个样本,涵盖四个检索目标模态和四种音频内容类型,并通过多阶段过滤流程确保任务难度与挑战性,实验表明当前最先进的模型在该任务上的平均准确率仅为43.44%,突显了该方向的重要研究价值。

Comments 43 pages

详情
英文摘要

Current omni-modal benchmarks mainly evaluate models under settings where multiple modalities are provided simultaneously, while the ability to start from audio alone and actively search for cross-modal evidence remains underexplored. In this paper, we introduce \textbf{Omni-DeepSearch}, a benchmark for audio-driven omni-modal deep search. Given one or more audio clips and a related question, models must infer useful clues from audio, invoke text, image, and video search tools, and perform multi-hop reasoning to produce a short, objective, and verifiable answer. Omni-DeepSearch contains 640 samples across 15 fine-grained categories, covering four retrieval target modalities and four audio content types. A multi-stage filtering pipeline ensures audio dependence, retrieval necessity, visual modality necessity, and answer uniqueness. Experiments on recent closed-source and open-source omni-modal models show that this task remains highly challenging: the strongest evaluated model, Gemini-3-Pro, achieves only 43.44\% average accuracy. Further analyses illustrate key bottlenecks in audio entity inference, query formulation, tool-use reliability, multi-hop retrieval, and cross-modal verification. These results highlight audio-driven omni-modal deep search as an important and underexplored direction for future multimodal agents.

2605.08729 2026-05-12 cs.CV cs.GR cs.MM cs.SD 版本更新

Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

Shihao Cheng, Jiaxu Zhang, Quanyue Song, Shansong Liu, Zhizhi Guo, Xiaolei Zhang, Chi Zhang, Xuelong Li, Zhigang Tu

AI总结 Unison 是一个统一的框架,旨在解决人类中心视频生成中动作、语音和声音之间异步特性带来的对齐难题。该方法通过语义引导的谐波策略,分离生成语音和音效组件,并利用双向音频交叉注意力和语义条件门控机制,提升声音清晰度并减少语音主导现象。此外,Unison 提出双向跨模态强制策略,通过解耦的去噪时间表实现动作与音频的同步,显著提升了生成视频在音频感知质量和跨模态同步方面的表现。

详情
英文摘要

Motion, speech, and sound effects are fundamental elements of human-centric videos, yet their heterogeneous temporal characteristics make joint generation highly challenging. Existing audio-video generation models often fail to maintain consistent alignment across these modalities, leading to noticeable mismatches between motion, speech, and environmental sounds. We present Unison, a unified framework that explicitly promotes coherence across the motion, speech, and sound modalities. Within the audio stream, Unison employs a semantic-guided harmonization strategy that decouples the generation of speech and sound-effect components. Leveraging bidirectional audio cross-attention and semantic-conditioned gating for semantic-driven adaptive recomposition, this approach effectively mitigates speech dominance and enhances acoustic clarity. For audio-motion synchronization, we propose a bidirectional cross-modal forcing strategy where the cleaner modality guides the noisier one through decoupled denoising schedules, reinforced by a progressive stabilization strategy. Extensive experiments demonstrate that Unison achieves state-of-the-art performance in both audio perceptual quality and cross-modal synchronization, highlighting the importance of explicit multimodal harmonization in human-centric video generation.

2605.08554 2026-05-12 cs.SD 版本更新

Online Segmented Beamforming via Dynamic Programming

Manan Mittal, Ryan M. Corey, Diego Cuji, John R. Buck, Andrew C. Singer

AI总结 在动态声学环境中,由于干扰源和声源随时间变化,传统波束成形方法难以准确识别静止区域。本文提出了一种基于动态规划的在线分段波束成形算法,通过数据驱动的时间分段方法,动态调整协方差矩阵估计窗口,以适应局部平稳性,并在环境突变时实时重置协方差估计,从而有效跟踪新出现的干扰源。实验表明,该方法在复杂混响环境中优于固定窗口的自适应方法。

Comments 4 pages, 2 figures

详情
英文摘要

In dynamic acoustic environments characterized by time-varying interferers and moving sources, effective beamforming requires accurately identifying stationary regions over time. Traditional Capon beamformers rely on the instantaneous ensemble covariance matrix, which is inaccessible in practice. Practical implementations overcome this by estimating the sample covariance matrix (SCM) through averaging over a block of temporal samples. However, in non-stationary settings, a naive batch approach fails. Moving interferers smear the SCM, causing the beamformer to place nulls in outdated locations while failing to track newly active interferers, thereby degrading its nulling capabilities. To address this fundamental limitation, an Online Segmented Beamformer is proposed. This algorithm incorporates data-driven temporal segmentation to causally minimize output power while dynamically adapting the SCM estimation windows to local stationarity. By framing the problem through the lens of dynamic programming, the proposed method tracks abrupt environmental changes and resets covariance estimates in real-time. We validate the performance of this framework in a complex, reverberant simulated acoustic environment and in highly reverberant real world experiments, demonstrating its superiority over fixed-window adaptive methods.

2605.05611 2026-05-12 cs.SD cs.AI eess.AS 版本更新

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

Rixi Xu, Qingyu Liu, Haitao Li, Yushen Chen, Zhikang Niu, Yunting Yang, Jian Zhao, Ke Li, Berrak Sisman, Qinyuan Cheng, Xipeng Qiu, Kai Yu, Xie Chen

AI总结 本文提出X-Voice,一个0.4B参数的多语言零样本语音克隆模型,使用户能够克隆任意人声并用30种语言说话。该模型基于420,000小时的多语言语料库训练,采用国际音标(IPA)作为统一表示,并设计了两阶段训练框架以无需复杂预处理即可实现零样本克隆。通过扩展F5-TTS架构,引入语言标识符双级注入和分类器自由引导的解耦调度机制,X-Voice在主观和客观评估中均优于现有系统,实现了与百亿参数模型相当的跨语言克隆能力。

Comments 16 pages, 4 figures, 9 tables

详情
英文摘要

In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified representation. To eliminate the reliance on prompt text without complex preprocessing like forced alignment, we design a two-stage training paradigm. In Stage 1, we establish X-Voice$_{\text{s1}}$ through standard conditional flow-matching training and use it to synthesize 10K hours of speaker-consistent segments as audio prompts. In Stage 2, we fine-tune on these audio pairs with prompt text masked to derive X-Voice$_{\text{s2}}$, which enables zero-shot voice cloning without requiring transcripts of audio prompts. Architecturally, we extend F5-TTS by implementing a dual-level injection of language identifiers and decoupling and scheduling of Classifier-Free Guidance to facilitate multilingual speech synthesis. Subjective and objective evaluation results demonstrate that X-Voice outperforms existing flow-matching based multilingual systems like LEMAS-TTS and achieves zero-shot cross-lingual cloning capabilities comparable to billion-scale models such as Qwen3-TTS. To facilitate research transparency and community advancement, we open-source all related resources.

2603.09007 2026-05-12 cs.SD cs.AI 版本更新

Gender Fairness in Audio Deepfake Detection: Performance and Disparity Analysis

Aishwarya Fursule, Shruti Kshirsagar, Anderson R. Avila

AI总结 本文研究了音频深度伪造检测中的性别公平性问题,分析了现有检测模型在不同性别上的性能差异。作者基于ASVspoof 5数据集,采用ResNet-18分类器,并结合四种音频特征进行评估,同时与基线模型AASIST进行对比。通过引入五种公平性指标,研究发现即使整体误识率差异较小,模型在性别上的错误分布仍存在显著差异,强调了传统性能指标的局限性,突出了公平性评估在构建更公正、可靠音频深度伪造检测系统中的重要性。

Comments Paper Accepted to IEEE CAI Conference 2026

详情
英文摘要

Audio deepfake detection aims to detect real human voices from those generated by Artificial Intelligence (AI) and has emerged as a significant problem in the field of voice biometrics systems. With the ever-improving quality of synthetic voice, the probability of such a voice being exploited for illicit practices like identity thest and impersonation increases. Although significant progress has been made in the field of Audio Deepfake Detection in recent times, the issue of gender bias remains underexplored and in its nascent stage In this paper, we have attempted a thorough analysis of gender dependent performance and fairness in audio deepfake detection models. We have used the ASVspoof 5 dataset and train a ResNet-18 classifier and evaluate detection performance across four different audio features, and compared the performance with baseline AASIST model. Beyond conventional metrics such as Equal Error Rate (EER %), we incorporated five established fairness metrics to quantify gender disparities in the model. Our results show that even when the overall EER difference between genders appears low, fairness-aware evaluation reveals disparities in error distribution that are obscured by aggregate performance measures. These findings demonstrate that reliance on standard metrics is unreliable, whereas fairness metrics provide critical insights into demographic-specific failure modes. This work highlights the importance of fairness-aware evaluation for developing a more equitable, robust, and trustworthy audio deepfake detection system.

2605.08224 2026-05-12 cs.IT cs.SD math.HO math.IT 版本更新

Uniqueness on a Continuum: Quantifying Tonal Ambiguity Using Information Theory

Michael Seltenreich

AI总结 本文提出了一种基于信息论的连续度量方法,用于量化音调模糊性,扩展了传统的“唯一性”概念。该方法解决了原有唯一性概念无法区分具有唯一性的集合、无法捕捉有限转调模式中的层次结构以及无法考虑时间展开等问题。该度量适用于音高类集合和不同调音系统,拓展了音调关系的分析范围,并为音乐理论与分析提供了实用工具。

Comments 14 pages, 6 figures, 9 tables

详情
英文摘要

We propose a continuous measure of tonal ambiguity that extends the established concept of uniqueness. While uniqueness is widely regarded as necessary for tonality, it cannot (i) discriminate among sets that possess it, (ii) capture hierarchical organization in modes of limited transposition, or (iii) account for temporal unfolding. To address these limitations, we introduce a companion measure, grounded in information theory, that quantifies tonal ambiguity on a continuous scale. The measure applies across pitch-class sets and tuning systems, expanding analytic coverage of tonal relationships and offering a practical tool for theory and analysis.

2605.08214 2026-05-12 cs.SD cs.AI eess.AS 版本更新

Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Long-Form Speech Recognition and Speaker Diarization

Mohammed Aman Bhuiyan, Md Sazzad Hossain Adib, Samiul Basir Bhuiyan, Amit Chakraborty, Aritra Islam Saswato, Ahmed Faizul Haque Dhrubo, Mohammad Ashrafuzzaman Khan

AI总结 本文针对孟加拉语长篇语音识别和说话人分段任务中的挑战,提出了基于Whisper和PyAnnote的改进方法。研究通过微调Whisper模型和PyAnnote分割模块,结合数据增强与定制数据集训练,显著提升了孟加拉语长时语音识别和说话人分段的性能。实验结果显示,所提出的系统在测试集上分别实现了0.2441的词错误率(WER)和0.2392的分段错误率(DER),优于原有预训练模型。

Comments 3 figures and 5 tables

详情
英文摘要

Automatic Speech Recognition (ASR) and speaker diarization in Bangla remain challenging due to long form recordings, diverse acoustic conditions, and significant speaker variability. This work addresses these two core tasks in Bangla spoken language understanding by developing robust systems for long form ASR and speaker diarization. For ASR (Problem 1), we fine tune the tugstugi bengaliai regional asr whisper medium model on a custom-curated dataset of approximately 15,000 chunked and aligned Bangla audio segments, employing full weight training with extensive data augmentation including noise injection, reverb simulation, echo, clipping distortion, and pitch/time perturbation. For speaker diarization (Problem 2), we fine-tune the pyannote/segmentation-3.0 model using PyTorch Lightning on the competition annotated diarization dataset, swapping the fine-tuned segmentation backbone into the pyannote/speaker-diarization-community-1 pipeline while retaining the pretrained speaker embedding and clustering components. Our ASR system achieves a Word Error Rate (WER) of 0.2441, while our diarization system achieves a Diarization Error Rate (DER) of 0.2392, both evaluated on the test set, demonstrating notable improvements over the respective pretrained baselines. We describe our complete pipeline, including data preprocessing, text normalization, audio augmentation, training strategies, inference optimization, and post-processing for both tasks.

2605.08194 2026-05-12 cs.SD eess.AS eess.SP 版本更新

ShipEcho -- An Interactive Tool for Global Mapping of Underwater Radiated Noise from Vessels

Mark Shipton, Valentino Denona, Đula Nađ, Roee Diamant

AI总结 本文介绍了一款名为 ShipEcho 的交互式网络地理信息系统(GIS),用于全球范围内实时绘制船舶辐射噪声(V-URN)地图。该工具利用基于社区的自动识别系统(AIS)数据,并结合已建立的船舶声学模型和海底地形数据进行传播模拟,生成包括不同频段的声压级和声暴露级在内的噪声地图。研究展示了 ShipEcho 在支持环境评估、决策制定和政策制定方面的应用潜力,并通过与实际声学记录的对比验证了其地图的准确性。

Comments 34 pages

详情
英文摘要

Underwater radiated noise from vessels (V-URN) is a recognized environmental stressor that negatively impacts marine ecosystems. Significant resources are invested in the development of V-URN monitoring indicators, regulatory frameworks, and management-oriented assessments. One approach with high potential for impact is V-URN mapping, which can provide actionable spatiotemporal information for environmental assessment and mitigation planning. Producing management-scale maps remains challenging as passive acoustic measurements are spatially sparse and many operational systems depend on specialist workflows and costly access to wide-area vessel activity data. To address these constraints, we introduce ShipEcho, a freely accessible web-based Geographic Information System (GIS) that provides near-real-time V-URN mapping using vessel data acquired through a community-based AIS exchange. Using established vessel SL models and propagation modeling informed by bathymetric data, ShipEcho produces near-real-time and cumulative noise maps across regions worldwide. These include sound pressure levels and sound exposure levels using standard indicators, including the 63~Hz and 125~Hz one-third octave bands and a 20--2000~Hz broadband level. We describe the system architecture, data pipeline, modeling workflow, and key assumptions, and evaluate map accuracy through comparison with acoustic recordings. We then demonstrate how ShipEcho can support management-level assessment, decision-making, and policy initiatives through practical use cases.