arXivDaily arXiv每日学术速递 周一至周五更新
2605.20014 2026-05-20 cs.SD 版本更新

Precise and Simple Audio-to-Score Alignment

精确且简单的音频到乐谱对齐

Silvan Peter, Patricia Hu, Gerhard Widmer

发表机构 * Institute of Computational Perception, Johannes Kepler University(计算感知研究所,约翰·凯普勒大学) LIT AI Lab, Linz Institute of Technology(LIT人工智能实验室,林茨技术学院)

AI总结 本文提出了一种直接连接音频样特征和符号级特征的算法,该算法基于符号对齐方法,实现了高精度且灵活的音频到乐谱对齐,适用于不同音色特性。

Comments published at the Music Encoding Conference (MEC) 2026

详情
AI中文摘要

音频到乐谱对齐是音乐信息检索中的长期挑战,也是音乐研究中最广泛适用的对齐任务。对齐算法匹配音乐作品的两个版本,这些版本需要处于可比格式中。音频到音频对齐匹配音频特征;当将音频文件与乐谱匹配时,必须要么合成乐谱,要么通过钢琴卷或其他类似特征序列推导出音频样特征。相比之下,符号对齐匹配符号编码的音符;在音频到乐谱场景中,这些通过音频文件的转录获得。在本文中,我们提出了一种算法,直接连接音频样特征和符号级特征。通过基于符号对齐方法的定制动态规划匹配算法,顺序音频特征编码的起始点和频谱激活被匹配到乐谱位置。所得到的方法既精确——超越了基于合成乐谱的广泛使用的音频到音频方法——又保持了其数字信号处理组件的灵活性,即该方法可以适应不同的音色特性,而无需单独的转录模型。此外,它继承了一些符号对齐的运行时优势,其算法复杂度在最坏情况下与符号乐谱(通常较短)和音频特征序列(通常较长)的长度成线性关系。在接下来的章节中,我们提供详细的算法描述,并在大规模独奏钢琴录音数据集上评估其对齐质量。

英文摘要

Audio-to-score alignment is a long-standing challenge in music information retrieval and arguably the most widely applicable alignment task for music research. Alignment algorithms match two versions of a piece of music, and for this to work these versions need to be in comparable formats. Audio-to-audio alignment matches audio features; when matching audio files to scores, they must either synthesize the score or derive audio-like features by means of piano rolls or similar feature sequences. Symbolic alignment, by contrast, matches symbolically encoded notes; in an audio-to-score scenario these would be obtained by a transcription of the audio file. In this article, we present an algorithm that bridges audio-like and symbol-level features directly. Sequential audio features encoding onset and spectral activation are matched to score positions by a bespoke dynamic programming-based matching algorithm derived from symbolic alignment methods. The resulting method is both precise - surpassing widely used audio-to-audio approaches based on synthesized scores -, and remains flexible in its digital signal processing components, i.e., the method is adaptable to diverse timbral characteristics without requiring a separate transcription model. Furthermore it inherits some of the symbolic alignment runtime advantages with an algorithmic complexity that is at worst linear in the length of the (typically short) symbolic score and (typically long) audio feature sequence. In the following sections, we provide a detailed algorithm description and evaluate its alignment quality on a large-scale dataset of solo piano recordings.

2605.19984 2026-05-20 cs.SD 版本更新

A conceptual framework for learning to listen by reward: Curiosity-driven search for novel sources

基于奖励的学习倾听的概念框架:好奇心驱动的新型声源搜索

Andreas Triantafyllopoulos, Jakub Šťastný, Alexios Terpinas, Tianyi Liu, Yuanqi Wang, Björn W. Schuller

发表机构 * CHI – Chair of Health Informatics, Technical University of Munich(健康信息学系,慕尼黑技术大学) MCML – Munich Center for Machine Learning(慕尼黑机器学习中心) MDSI – Munich Data Science Institute(慕尼黑数据科学研究所) GLAM – Group on Language, Audio, & Music, Imperial College(语言、音频与音乐小组,帝国学院)

AI总结 本文提出了一种基于奖励的学习倾听的概念框架,通过好奇心驱动的新型声源搜索来解决音频领域中强化学习应用不足的问题。

详情
AI中文摘要

强化学习是一种强大的学习范式,已在许多领域推动了进展。其核心承诺在于通过高层目标学习,而无需细粒度标签。然而,在音频领域,它仍然难以应用,相较于计算机视觉或其他领域,受到的关注较少。关键问题是:如何让智能体通过奖励驱动的探索来学习倾听?在本文中,我们概述了先前的尝试,并提出了一种新的学习倾听的概念框架。我们的方法依赖于持续寻找新的声音源。我们制定了我们的框架,讨论了开放的技术挑战,并展示了一个初步的证明概念实现,以展示我们方法的可行性。

英文摘要

Reinforcement learning is a powerful learning paradigm that has spearheaded progress in numerous domains. Its core promise lies in learning through high-level goals without the need for granular labels. However, it still remains elusive in the realm of audio, where it has received substantially less attention than in computer vision or other domains. The key question remains: how can agents learn to listen purely via reward-driven exploration? In this contribution, we present an overview of previous attempts and a new conceptual framework for learning to listen by reward. Our approach depends on the continuous search for novel sound sources. We formulate our framework, discuss open technical challenges, and present a first proof-of-concept implementation that showcases the feasibility of our approach.

2605.19833 2026-05-20 cs.SD cs.AI cs.CL cs.MM eess.AS 版本更新

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

Mega-ASR: 通过扩大现实世界声学模拟实现野外语音识别

Zhifei Xie, Kaiyu Pang, Haobin Zhang, Deheng Ye, Xiaobin Hu, Shuicheng Yan, Chunyan Miao

发表机构 * NTU(国立新加坡大学) NUS(新加坡国立大学) Shanghai AI Lab(上海人工智能实验室)

AI总结 本文提出Mega-ASR,一种统一的野外语音识别框架,结合可扩展的复合数据构建与渐进的声学到语义优化,通过在7种经典声学现象和54种物理上合理的复合场景上训练,显著提升了在恶劣环境下的语音识别性能。

Comments Project page: https://xzf-thu.github.io/Mega-ASR/. Code, models, and dataset will be released. A robust ASR framework targeting in-the-wild and compositional acoustic scenarios where conventional ASR systems fail

详情
AI中文摘要

尽管自动语音识别(ASR)和大型音频-语言模型取得了快速进展,但现实环境中鲁棒的识别仍然受到一个“声学鲁棒性瓶颈”的限制:模型在严重、复合的失真下常常失去声学基础并产生遗漏或幻觉。我们提出了Mega-ASR,一种统一的ASR-in-the-wild框架,结合可扩展的复合数据构建与渐进的声学到语义优化。我们引入了Voices-in-the-Wild-2M,涵盖7种经典声学现象和54种物理上合理的复合场景,并通过Acoustic-to-Semantic Progressive Supervised Fine-Tuning和Dual-Granularity WER-Gated Policy Optimization训练Mega-ASR。大量实验表明,Mega-ASR在恶劣条件ASR基准测试中显著优于先前的最先进系统(在VOiCES R4-B-F上45.69% vs. 54.01%,在NOIZEUS Sta-0上21.49% vs. 29.34%)。在复杂的复合声学场景中,Mega-ASR进一步在强大的开源和闭源基线中实现了超过30%的相对WER降低,建立了在野外鲁棒ASR的可扩展范式。

英文摘要

Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, Mega-ASR further delivers over 30% relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild.

2511.21577 2026-05-20 cs.SD cs.AI 版本更新

HarmonicAttack: An Adaptive Cross-Domain Audio Watermark Removal

HarmonicAttack: 一种自适应跨领域音频水印移除方法

Kexin Li, Xiao Hu, Ilya Grishchenko, David Lie

发表机构 * University of Toronto(多伦多大学)

AI总结 本文提出HarmonicAttack,一种无需访问目标水印检测器的新型音频水印移除方法,通过训练通用模型来移除音频水印,同时在不同分布数据集上保持高感知质量。

Comments Under Review

详情
AI中文摘要

高质量的AI生成音频的可用性引发了诸如虚假信息活动和语音克隆欺诈等安全挑战。对抗AI生成音频的滥用的关键防御措施是通过水印标记,以便能够轻易区分真实音频。那些试图滥用AI生成音频的人可能会尝试移除音频水印,因此研究有效的水印移除技术对于客观评估音频水印的鲁棒性至关重要。先前的水印移除方案通常假设在移除过程中可以访问目标水印检测器。这种假设往往不切实际,可能导致对当前水印方案的过度自信。我们引入了HarmonicAttack,一种新的音频水印移除方法,它不需要访问目标水印算法。它只需要一组原始和水印样本来训练一个能够从音频样本中移除水印的通用模型。我们还发现,训练样本不需要与目标样本具有相同的分布,因为我们的攻击在面对非分布样本时具有最小的退化。与现有水印移除攻击相比,HarmonicAttack在移除最新方案(包括AudioSeal、WavMark、SilentCipher和AudioMarkNet)的水印方面更加有效,同时保持高感知质量。尽管HarmonicAttack是在LibriSpeech数据集上针对AudioSeal训练的,但它能够泛化到未见过的数据集和水印方案。例如,在VCTK上,HarmonicAttack对AudioMarkNet的识别准确率达到了92%,明显优于最佳基线的38%。在FMA上,HarmonicAttack对所有水印达到了100%的识别准确率,而最佳基线在AudioSeal上仅达到2%,在WavMark上达到44%。

英文摘要

The availability of high-quality, AI-generated audio raises security challenges such as misinformation campaigns and voice-cloning fraud. A key defense against the misuse of AI-generated audio is by watermarking it, so that it can be easily distinguished from genuine audio. Those seeking to misuse AI-generated audio may attempt to remove audio watermarks, so studying effective watermark removal techniques is critical to objectively evaluate the robustness of audio watermarks. Previous watermark removal schemes typically assume access to the target watermark detector during the removal process. This assumption is often impractical, which may lead to a false sense of confidence in current watermark schemes. We introduce HarmonicAttack, a novel audio watermark removal method that requires no access to the target watermark algorithm. It only needs a number of original and watermarked samples to train a general model capable of removing watermarks from audio samples. We also find that training samples do not need to share the same distribution as target samples, as our attack generalizes to out-of-distribution samples with minimal degradation. Compared with existing watermark removal attacks, HarmonicAttack is more effective at removing watermarks from state-of-the-art schemes, including AudioSeal, WavMark, SilentCipher, and AudioMarkNet, while maintaining high perceptual quality. Although HarmonicAttack is trained on the LibriSpeech dataset against AudioSeal, it generalizes across unseen datasets and watermarking schemes. For instance, on VCTK, HarmonicAttack achieves a 92% ASR against AudioMarkNet, substantially outperforming the best baseline at 38%. On FMA, HarmonicAttack reaches 100% ASR against all watermarks, whereas the best baseline achieves only 2% against AudioSeal and 44% against WavMark.

2605.19695 2026-05-20 eess.AS cs.SD 版本更新

Cross-Talk Speech Reduction, by Separation, for Separation

通过分离实现的交叉talk语音消除,用于分离

Zhong-Qiu Wang, Samuele Cornell

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Southern University of Science and Technology(南方科技大学) Language Technologies Institute(语言技术研究所) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出了一种旨在从近场混合信号中分离说话人语音的交叉talk消除任务,并提出了一种名为CTRnet的新型方法,可以直接在真实录制的近场和远场混合信号对上训练以完成CTR。基于CTRnet,进一步提出基于伪标签的远场语音分离(PuLSS),利用CTRnet估计的干净语音作为伪标签来训练分离远场混合信号的模型。该框架的主要优势是CTRnet和PuLSS都可以在目标域的真实数据上进行训练,解决了模型仅在模拟数据上训练时通常观察到的泛化差距。在CHiME-6数据集上,该框架在Oracle和估计说话人分离条件下实现了最先进的ASR性能,超过了所有CHiME-{7,8}挑战提交。据我们所知,这是首个在真实对话“语音在野外”数据上显著优于引导源分离的神经语音分离方法。

Comments in submission

详情
AI中文摘要

在对话语音分离和识别任务中,通常在训练数据收集期间将近场麦克风附接到每个说话人上,以捕捉近场、近距离混合信号,同时使用远场麦克风记录远场混合信号。每种近场混合信号对佩戴者来说都有相对较高的能量水平,可以直观地作为训练远场语音分离模型的弱监督。然而,它们并不足以干净地用于此目的,因为它们通常包含来自其他说话人的强交叉talk语音以及背景噪声。为了解决这个问题,我们提出了一种交叉talk消除(CTR)任务,旨在从每个近场混合信号中隔离说话人的语音,并提出了一种名为CTRnet的新型方法,可以直接在真实录制的近场和远场混合信号对上训练以完成CTR。基于CTRnet,我们进一步提出基于伪标签的远场语音分离(PuLSS),利用CTRnet估计的干净语音作为伪标签来训练分离远场混合信号的模型。该框架的主要优势是CTRnet和PuLSS都可以在目标域的真实数据上进行训练,解决了模型仅在模拟数据上训练时通常观察到的泛化差距。在CHiME-6数据集上,该框架在Oracle和估计说话人分离条件下实现了最先进的ASR性能,超过了所有CHiME-{7,8}挑战提交。据我们所知,这是首个在真实对话“语音在野外”数据上显著优于引导源分离的神经语音分离方法。

英文摘要

In conversational speech separation and recognition tasks, close-talk microphones are typically attached to each speaker during training data collection to capture near-field, close-talk mixture signals, in addition to using far-field microphones to record far-field mixture signals. Each such close-talk mixture exhibits a reasonably high energy level for the wearer and could intuitively serve as weak supervision for training far-field speech separation models directly on real-recorded far-field signals. However, they are not sufficiently clean for this purpose, as they often contain strong cross-talk speech from other speakers in addition to background noise. To address this, we propose cross-talk reduction (CTR), a task aiming to isolate the wearer's speech from each close-talk mixture, and a novel method called CTRnet, which can be trained directly on real-recorded pairs of close-talk and far-field mixtures to accomplish CTR. Building on CTRnet, we further propose pseudo-label based far-field speech separation (PuLSS), which uses CTRnet's estimated clean speech as pseudo-labels to train models for separating far-field mixtures. A key advantage of the proposed framework is that both CTRnet and PuLSS can be trained on real-recorded data from the target domain, addressing the generalization gap commonly observed when models are trained exclusively on simulated data. On the CHiME-6 dataset, our framework achieves state-of-the-art ASR performance under both oracle and estimated speaker diarization, surpassing all CHiME-{7,8} challenge submissions. To our knowledge, it is the first neural speech separation method that substantially outperforms guided source separation on real conversational "speech-in-the-wild" data.

2605.19632 2026-05-20 cs.LO cs.SD 版本更新

Executable Boundary Contracts for Sound Event Traces

可执行的边界合同用于声音事件轨迹

Faruk Alpay, Hamdi Alakkad

发表机构 * Bahcesehir University(巴切谢希尔大学)

AI总结 本文提出了一种可执行的边界合同,用于有限声音事件轨迹的测量,通过定义帧片段、事件层和相关约束来评估时间边界行为,以改进声音事件报告的准确性。

Comments 39 pages. Finite frame core code, tables, manifests, and Lean checks are ancillary material

详情
AI中文摘要

声音事件报告通常将时间边界行为压缩为帧、片段或事件分数。本文定义了有限声音事件轨迹的可执行边界合同。帧片段是一种有界的布尔片段,可嵌入STL后通过网格投影。事件层增加了声明的区间匹配、持续时间条款、碎片化条款和受限制的向量评分。目的是测量,而不是新的通用时间逻辑或挑战排行榜。本文的成果评估了受控的Mini LibriSpeech种子场景、MAESTRO真实声音景观、冻结的预训练时间探针以及官方的DCASE 2024任务4基准赛道。在这些赛道上,标准分数和合同坐标以可解释的方式存在分歧。最强的真实语料发现是联合活动可以隐藏类型边界失败,而外部DCASE输出提供了类索引挑战级别的参考。代码、生成的表格、清单和Lean检查用于有限帧核心作为附属材料。

英文摘要

Sound event reports often compress timed boundary behavior into frame, segment, or event scores. This paper defines executable boundary contracts for finite sound event traces. The frame fragment is a bounded Boolean fragment embeddable in STL after grid projection. The event layer adds declared interval matching, duration clauses, fragmentation clauses, and obligation restricted vector scoring. The aim is measurement, not a new general temporal logic and not a challenge leaderboard. The artifact evaluates controlled Mini LibriSpeech seeded scenes, MAESTRO Real soundscapes, frozen pretrained timing probes, and an official DCASE 2024 Task 4 baseline track. Across these tracks, standard scores and contract coordinates disagree in interpretable ways. The strongest real corpus finding is that union activity can hide typed boundary failure, while external DCASE outputs provide a class indexed challenge level reference. Code, generated tables, manifests, and Lean checks for the finite frame core are supplied as ancillary material.

2605.19541 2026-05-20 cs.SD 版本更新

Optimising Neural Speech Codecs for 300bps Communication using Reinforcement Learning

利用强化学习优化神经语音编解码器用于300bps通信

Junyi Wang, Chi Zhang, Jing Qian, Haifeng Luo, Hao Wang, Zengrui Jin, Chao Zhang

发表机构 * Tsinghua University(清华大学) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 本文提出ClariCodec,一种在300bps下工作的神经语音编解码器,通过将量化视为随机策略,利用强化学习优化可懂度,从而在极端压缩水平下减少词错误率。

详情
AI中文摘要

在带宽受限的通信中,如卫星和水下信道,语音往往需要在超低比特率下传输,其中可懂性是主要目标。在如此极端的压缩水平下,通过声音重建损失训练的编解码器倾向于将比特分配给感知细节,导致词错误率(WER)显著下降。本文提出了ClariCodec,一种在300比特每秒(bps)下工作的神经语音编解码器,将量化重新表述为随机策略,从而通过强化学习(RL)优化可懂性。具体来说,编码器使用由WER驱动的奖励进行微调,而声音重建流程保持冻结。即使没有强化学习,ClariCodec在LibriSpeech测试清洁集上以300bps实现了4.64%的WER,已经与在更高比特率下工作的编解码器具有竞争力。进一步的强化学习微调将WER降低到测试清洁集上的3.55%和测试其他集上的10.4%,对应的相对减少为23%,同时保持感知质量。

英文摘要

In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to allocate bits to perceptual detail, leading to substantial degradation in word error rate (WER). This paper proposes ClariCodec, a neural speech codec operating at 300 bit per second (bps) that reformulates quantisation as a stochastic policy, enabling reinforcement learning (RL)-based optimisation of intelligibility. Specifically, the encoder is fine-tuned using WER-driven rewards while the acoustic reconstruction pipeline remains frozen. Even without RL, ClariCodec achieves 4.64% WER on the LibriSpeech test-clean set at 300 bps, already competitive with codecs operating at higher bitrates. Further RL fine-tuning reduces WER to 3.55% on test-clean and 10.4% on test-other, corresponding to a 23% relative reduction while preserving perceptual quality.

2605.19101 2026-05-20 cs.SD cs.LG 版本更新

Heterogeneity-Aware Dataset Scheduling for Efficient Audio Large Language Model Training

面向异质性的数据集调度以实现高效的音频大语言模型训练

Yanru Wu, Jianning Wang, Chongxin Gan, Yang Li

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Independent Researcher(独立研究者) The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出了一种面向异质性的数据集调度方法GST,通过将数据集分组并按渐进调度策略引入,平衡了并行训练的稳定性与序列优化的效率,从而在14个AudioQA数据集上实现了30-40%的更快收敛速度。

详情
AI中文摘要

训练通用的音频大语言模型(ALLMs)以跨多样化的数据集进行训练对于全面的音频理解至关重要,但面临由于数据集异质性导致的显著挑战,这通常会导致冲突的梯度和缓慢的收敛。尽管其影响重大,如何在训练过程中显式管理这种异质性仍鲜有研究,当前的做法主要依赖于均匀混合。在本文中,我们从收敛性角度分析多数据集AudioQA训练,并提出分组序列训练(GST)。GST战略性地将数据集分为具有亲和力的数据集组,并通过渐进调度协议引入这些数据集,有效地平衡了并行训练的稳定性与序列优化的效率。为了确保可扩展性,我们开发了基于梯度的亲和度度量,以捕捉跨数据集的关系,而无需采用具有抑制成本的经验转移性估计。在14个AudioQA数据集上的广泛评估表明,GST在标准并行训练上实现了30-40%更快的收敛速度,同时保持或超越混合所有训练的性能。我们的结果提供了理论见解和一个实用且模型无关的框架,用于高效的大规模ALLM优化。

英文摘要

Training general-purpose Audio Large Language Models (ALLMs) across diverse datasets is essential for holistic audio understanding, yet it faces significant challenges due to dataset heterogeneity, which often leads to conflicting gradients and slow convergence. Despite its impact, how to explicitly manage this heterogeneity during training remains underexplored, with current practices relying primarily on uniform mixture. In this work, we analyze multi-dataset AudioQA training from a convergence perspective and propose Grouped Sequential Training (GST). GST strategically organizes datasets into affinity-aware groups and introduces them via a progressive scheduling protocol, effectively balancing the stability of parallel training with the efficiency of sequential optimization. To ensure scalability, we develop gradient-based affinity metrics that capture inter-dataset relationships without the prohibitive cost of empirical transferability estimation. Extensive evaluations on 14 AudioQA datasets spanning speech, music, and environmental sounds demonstrate that GST achieves 30--40\% faster convergence than standard parallel training while maintaining or even surpassing the performance of mix-all training. Our results provide both theoretical insights and a practical, model-agnostic framework for efficient large-scale ALLM optimization.

2605.16681 2026-05-20 eess.AS cs.SD eess.SP 版本更新

A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models

音频超分辨率和带宽扩展从判别模型到生成模型的综述

Ningyuan Yang, Yize Li, Diego A. Cuji, Ryan M. Corey, Pu Zhao, Xue Lin, Andrew C. Singer

发表机构 * Discovery Partners Institute(发现伙伴研究所)

AI总结 本文综述了音频超分辨率和带宽扩展领域,从判别模型向生成模型的转变,总结了早期判别模型的局限性以及生成模型在表示域、架构、条件机制等方面的改进,探讨了大语言模型和多模态基础模型等新兴方向,并指出了感知评估、相位建模和实际应用泛化等开放挑战。

Comments Under review

详情
AI中文摘要

音频超分辨率(SR),也称为带宽扩展(BWE),旨在从低分辨率(LR)或带限(BL)观测中重建高保真信号,这是一个由于缺失高频(HF)内容而固有的病态任务。本文提供了该领域的全面概述,特别关注从判别映射到现代生成建模的范式转变。我们首先回顾了早期判别深度神经网络(DNN)模型,这些模型将BWE/SR视为确定性映射问题,并容易产生回归到均值效应和频谱过平滑。然后我们系统地回顾了生成方法,包括自回归(AR)模型、变分自编码器(VAEs)、生成对抗网络(GANs)、扩散和分数模型、流方法以及Schrödinger bridges。在这些方法中,我们检查了关键设计方面,包括表示域、架构、条件机制以及重建保真度、感知质量、鲁棒性和计算效率之间的权衡。此外,我们讨论了涉及大语言模型(LLMs)和多模态基础模型的新兴方向,并突出了感知评估、相位建模和现实世界泛化等开放挑战。通过提供结构化的分类法和统一的视角,本文建立了全面的基础,并为从确定性点估计向分布感知生成建模推进BWE/SR提供了实用的路线图。

英文摘要

Audio super-resolution (SR), also referred to as bandwidth extension (BWE), aims to reconstruct high-fidelity signals from low-resolution (LR) or band-limited (BL) observations, an inherently ill-posed task due to the ambiguity of missing high-frequency (HF) content. This survey provides a comprehensive overview of the field, with a particular focus on the paradigm shift from discriminative mapping to modern generative modeling. We first review early discriminative deep neural network (DNN) models, which formulate BWE/SR as a deterministic mapping problem and are prone to regression-to-the-mean effects and spectral over-smoothing. We then systematically review generative approaches, including autoregressive (AR) models, variational autoencoders (VAEs), generative adversarial networks (GANs), diffusion and score-based models, flow-based methods, and Schrödinger bridges. Across these approaches, we examine key design aspects, including representation domain, architecture, conditioning mechanisms, and trade-offs among reconstruction fidelity, perceptual quality, robustness, and computational efficiency. Furthermore, we discuss emerging directions involving large language models (LLMs) and multimodal foundation models, and highlight open challenges in perceptual evaluation, phase modeling, and real-world generalization. By providing a structured taxonomy and unified perspective, this survey establishes a comprehensive foundation and offers a practical roadmap for advancing BWE/SR from deterministic point estimation toward distribution-aware generative modeling.

2605.02223 2026-05-20 cs.SD cs.CV 版本更新

Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

迈向细粒度语音修补取证:一个多区域篡改定位的数据集、方法和度量标准

Tung Vu, Yen Nguyen, Hai Nguyen, Cuong Pham, Cong Tran

发表机构 * Posts and Telecommunications Institute of Technology

AI总结 本文提出MIST数据集、ISA方法和SF1@tau度量标准,用于多区域语音修补检测,揭示现有深度伪造检测器在细粒度语音修补检测上的不足。

详情
AI中文摘要

近年来,语音克隆和文本到语音合成技术的进步使部分语音操纵——即攻击者在语音中替换几个词以改变其含义同时保持说话者身份——成为一种日益现实的威胁。现有音频深度伪造检测基准主要集中在句级二元分类或单区域篡改,无法检测和定位未知数量的多区域修补内容。我们通过三个贡献填补这一空白:首先,我们引入MIST(多区域修补语音篡改),一个覆盖6种语言、每句包含1-3个独立修补词级段的大型多语言数据集,通过LLM引导的语义替换和神经语音克隆生成,其中虚假内容仅占每句的2-7%。其次,我们提出了ISA(迭代段分析),一种与backbone无关的框架,通过粗到细的滑动窗口分类,结合容差区域提议和边界细化,无需先验知识即可恢复所有篡改区域。第三,我们定义了SF1@tau,一个基于时间IoU匹配的段级F1度量标准,联合评估区域计数准确性和定位精度。零样本评估显示,细粒度语音修补仍无法被现有深度伪造检测器解决:句级分类器在完全合成语音上对MIST句的伪造概率接近零,而ISA在这一具有挑战性的设置中始终优于非迭代基线,且数据集、代码和评估工具包已公开发布。

英文摘要

Recent advances in voice cloning and text-to-speech synthesis have made partial speech manipulation - where an adversary replaces a few words within an utterance to alter its meaning while preserving the speaker's identity - an increasingly realistic threat. Existing audio deepfake detection benchmarks focus on utterance-level binary classification or single-region tampering, leaving a critical gap in detecting and localizing multiple inpainted segments whose count is unknown a priori. We address this gap with three contributions. First, we introduce MIST (Multiregion Inpainting Speech Tampering), a large-scale multilingual dataset spanning 6 languages with 1-3 independently inpainted word-level segments per utterance, generated via LLM-guided semantic replacement and neural voice cloning, with fake content constituting only 2-7% of each utterance. Second, we propose ISA (Iterative Segment Analysis), a backbone-agnostic framework that performs coarse-to-fine sliding-window classification with gap-tolerant region proposal and boundary refinement to recover all tampered regions without prior knowledge of their count. Third, we define SF1@tau, a segment-level F1 metric based on temporal IoU matching that jointly evaluates region count accuracy and localization precision. Zero-shot evaluation reveals that partial inpainting at word granularity remains unsolved by existing deepfake detectors: utterance-level classifiers trained on fully synthesized speech assign near zero fake probability to MIST utterances where only 2-7% of content is manipulated. ISA consistently outperforms non-iterative baselines in this challenging setting, and the dataset, code, and evaluation toolkit are publicly released.

2602.11910 2026-05-20 cs.SD cs.LG 版本更新

TADA! Tuning Audio Diffusion Models through Activation Steering

TADA! 通过激活引导调整音频扩散模型

Łukasz Staniszewski, Katarzyna Zaleska, Mateusz Modrzejewski, Kamil Deja

发表机构 * Warsaw University of Technology(华沙技术大学) IDEAS Research Institute(IDEAS研究院)

AI总结 本文通过激活引导技术揭示音频扩散模型中的语义瓶颈,并展示了局部激活引导在音频概念调节中的新状态-of-the-art性能。

Comments Preprint

详情
AI中文摘要

音频扩散模型能够从文本生成高质量的音乐,但实现对特定音乐属性的精细控制仍然具有挑战性,因为其内部机制对高级概念的表示尚不明确。在本文中,我们利用激活修补技术证明,最近的音频扩散架构存在语义瓶颈,其中一小部分连续的注意力层控制不同的音乐概念,例如特定乐器、人声或音乐类型的存在。在此基础上,我们系统地评估了广泛的应用引导方法,比较了激活引导与提示级、乐谱空间和权重空间干预,分析了引导机制与干预位置之间的相互作用。我们的新基准,通过广泛的用户研究支持,证明了局部激活引导在音频概念调节中建立了新的状态-of-the-art性能。

英文摘要

Audio diffusion models can synthesize high-fidelity music from text, yet achieving fine-grained control over specific musical attributes remains challenging, as their internal mechanisms for representing high-level concepts are poorly understood. In this work, we use activation patching to demonstrate that recent audio diffusion architectures exhibit a semantic bottleneck, where a small, shared subset of consecutive attention layers controls distinct musical concepts, such as the presence of specific instruments, vocals, or genres. Building on this, we systematically evaluate a broad spectrum of steering paradigms, comparing activation steering against prompt-level, score-space, and weight-space interventions, analyzing the interaction between the steering mechanism and the intervention site. Our new benchmark, supported by an extensive user study, demonstrates that localized activation steering establishes a new state-of-the-art in audio concept modulation.

2506.14148 2026-05-20 cs.SD cs.CL eess.AS 版本更新

Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment

基于声学散射的非侵入式物体分类AI:一项关于头发评估的案例研究

Long-Vu Hoang, Tuan Nguyen, Tran Huy Dat

AI总结 本文提出了一种利用声学散射进行非侵入式物体分类的新方法,通过头发评估的案例研究进行演示。通过发射声学刺激并捕捉带有头发样本的头部对象散射信号,利用AI驱动的深度学习声学分类技术对头发类型和湿度进行分类。我们评估了包括(i)完全监督深度学习、(ii)嵌入式分类、(iii)监督基础模型微调和(iv)自监督模型微调在内的全面方法。我们的最佳策略通过微调自监督模型的所有参数实现了接近90%的分类准确率。这些结果凸显了声学散射作为隐私保护、非接触替代视觉分类的潜力,为各行业应用提供了巨大前景。

Comments This paper has been retracted by the authors. Due to miscommunication, the authorship is incomplete and missing early contributions

详情
AI中文摘要

本文提出了一种利用声学散射进行非侵入式物体分类的新方法,通过头发评估的案例研究进行演示。当入射波与物体相互作用时,会生成散射声场,该声场编码了结构和材料属性。通过发射声学刺激并捕捉头部带头发样本对象的散射信号,我们利用AI驱动的深度学习声学分类技术对头发类型和湿度进行分类。我们评估了包括(i)完全监督深度学习、(ii)嵌入式分类、(iii)监督基础模型微调和(iv)自监督模型微调在内的全面方法。我们的最佳策略通过微调自监督模型的所有参数实现了接近90%的分类准确率。这些结果凸显了声学散射作为隐私保护、非接触替代视觉分类的潜力,为各行业应用提供了巨大前景。

英文摘要

This paper presents a novel non-invasive object classification approach using acoustic scattering, demonstrated through a case study on hair assessment. When an incident wave interacts with an object, it generates a scattered acoustic field encoding structural and material properties. By emitting acoustic stimuli and capturing the scattered signals from head-with-hair-sample objects, we classify hair type and moisture using AI-driven, deep-learning-based sound classification. We benchmark comprehensive methods, including (i) fully supervised deep learning, (ii) embedding-based classification, (iii) supervised foundation model fine-tuning, and (iv) self-supervised model fine-tuning. Our best strategy achieves nearly 90% classification accuracy by fine-tuning all parameters of a self-supervised model. These results highlight acoustic scattering as a privacy-preserving, non-contact alternative to visual classification, opening huge potential for applications in various industries.

2105.00933 2026-05-20 cs.SD cs.AI cs.LG eess.AS 版本更新

Deep Neural Network for Musical Instrument Recognition using MFCCs

基于MFCCs的音乐乐器识别深度神经网络

Saranga Kingkor Mahanta, Abdullah Faiz Ur Rahman Khilji, Partha Pakray

发表机构 * Department of Electronics and Communication Engineering, National Institute of Technology, Silchar, Assam, India(电子与通信工程系,国家理工学院,西拉char,阿萨姆,印度)

AI总结 本文提出一种基于MFCCs的深度神经网络模型,用于对二十种不同类别的音乐乐器进行分类,利用伦敦爱乐乐团数据集实现高精度识别。

Journal ref Computacion y Sistemas, Vol 25, No 2 (2021): 25(2) 2021

详情
AI中文摘要

高效自动音乐分类任务在AI应用于音乐领域中具有重要性,并构成了各种高级应用的基础。音乐乐器识别是通过音频来识别乐器的任务。这种音频也称为声音振动,被模型用来与乐器类别匹配。在本文中,我们使用了一个经过训练以对二十种不同类别的音乐乐器进行分类的人工神经网络(ANN)模型。这里我们仅使用音频数据的梅尔频率倒谱系数(MFCCs)。我们的模型在完整的伦敦爱乐乐团数据集上进行训练,该数据集包含属于四个家族(木管乐器、铜管乐器、打击乐器和弦乐器)的二十种乐器类别。基于实验结果,我们的模型在相同数据集上实现了最先进的准确性。

英文摘要

The task of efficient automatic music classification is of vital importance and forms the basis for various advanced applications of AI in the musical domain. Musical instrument recognition is the task of instrument identification by virtue of its audio. This audio, also termed as the sound vibrations are leveraged by the model to match with the instrument classes. In this paper, we use an artificial neural network (ANN) model that was trained to perform classification on twenty different classes of musical instruments. Here we use use only the mel-frequency cepstral coefficients (MFCCs) of the audio data. Our proposed model trains on the full London philharmonic orchestra dataset which contains twenty classes of instruments belonging to the four families viz. woodwinds, brass, percussion, and strings. Based on experimental results our model achieves state-of-the-art accuracy on the same.

1912.11333 2026-05-20 cs.SD cs.LG eess.AS 版本更新

Audio-based automatic mating success prediction of giant pandas

基于音频的 giant pandas 雌雄配对成功率预测

WeiRan Yan, MaoLin Tang, Qijun Zhao, Peng Chen, Dunwu Qi, Rong Hou, Zhihe Zhang

AI总结 本文提出了一种基于音频的自动方法,用于预测 giant pandas 的配对成功率,通过提取音频特征并使用深度神经网络进行分类,以辅助大熊猫的繁殖研究。

Comments The manuscript needs further revision

详情
AI中文摘要

大熊猫,通常被视为沉默的动物,在繁殖季节会发出显著更多的声音,这表明声音对于协调其繁殖和表达配对偏好至关重要。先前的生物学研究也证明,大熊猫的声音与配对结果和繁殖有关。本文首次尝试开发一种基于其声音的自动方法,用于预测大熊猫的配对成功率。给定一个记录于繁殖接触期间的大熊猫音频序列,我们首先裁剪出大熊猫的声音段落,并对其进行幅度和长度的归一化。然后从音频段落中提取声学特征,并将这些特征输入深度神经网络,以将配对分类为成功或失败。所提出的深度神经网络采用卷积层后接双向门控循环单元来提取声音特征,并应用注意力机制,以迫使网络专注于最相关特征。在过去九年收集的数据集上的评估实验取得了有希望的结果,证明了基于音频的自动配对成功率预测方法在辅助大熊猫繁殖方面的潜力。

英文摘要

Giant pandas, stereotyped as silent animals, make significantly more vocal sounds during breeding season, suggesting that sounds are essential for coordinating their reproduction and expression of mating preference. Previous biological studies have also proven that giant panda sounds are correlated with mating results and reproduction. This paper makes the first attempt to devise an automatic method for predicting mating success of giant pandas based on their vocal sounds. Given an audio sequence of mating giant pandas recorded during breeding encounters, we first crop out the segments with vocal sound of giant pandas, and normalize its magnitude, and length. We then extract acoustic features from the audio segment and feed the features into a deep neural network, which classifies the mating into success or failure. The proposed deep neural network employs convolution layers followed by bidirection gated recurrent units to extract vocal features, and applies attention mechanism to force the network to focus on most relevant features. Evaluation experiments on a data set collected during the past nine years obtain promising results, proving the potential of audio-based automatic mating success prediction methods in assisting giant panda reproduction.