arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.22732 2026-05-22 cs.AI cs.CL cs.HC cs.SD eess.AS 版本更新

Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models

超越语音情感识别:利用基于LLM和语音情感模型的政治演讲多模态Pathos分析

Juergen Dietrich

发表机构 * Democracy Intelligence gGmbH(民主智能有限责任公司)

AI总结 本文研究了语音情感识别模型是否能作为政治演讲分析中Pathos维度的代理,通过TRUST多智能体大语言模型(LLM)管道进行操作。使用德国议会全体会议中Felix Banaszak的演讲作为案例研究,比较了三种分析模式:(1) emotion2vec_plus_large,一个通过后验Russell Circumplex投影得到连续唤醒度和估值的语音情感识别(SER)模型;(2) Gemini 2.5 Flash,一个分析完整演讲音频及其转录文本的LLM,以开放和上下文感知的方式进行;(3) TRUST-Pathos分数,来自三个倡导者LLM监督集合。斯皮尔曼等级相关性显示,Gemini估值与TRUST-Pathos高度相关(rho = +0.664,p < 0.001),而emotion2vec估值不相关(rho = +0.097,p = 0.499)。我们进一步通过系统评估柏林情感语音数据库(EMO-DB)使用Gemini在开放注释范式下,证明标准SER基准语料库存在表演性演讲、文化偏见和类别不兼容性。我们的结果表明,基于LLM的多模态分析在捕捉语义定义的政治情感方面比单独的语音模型更有效,而语音特征仍对低层次唤醒度估计有帮助。未来的工作将扩展这种方法到视频分析中,结合面部表情和眼神。

Comments 13 pages, 1 figure

详情
AI中文摘要

我们研究语音情感识别模型是否能作为政治演讲分析中Pathos维度的代理,如由TRUST多智能体大语言模型(LLM)管道定义的那样。使用Felix Banaszak在德国议会全体会议中的演讲(51个片段,245秒)作为案例研究,我们比较了三种分析模式:(1) emotion2vec_plus_large,一个通过后验Russell Circumplex投影得到连续唤醒度和估值的语音情感识别(SER)模型;(2) Gemini 2.5 Flash,一个分析完整演讲音频及其转录文本的LLM,以开放和上下文感知的方式进行;(3) TRUST-Pathos分数,来自三个倡导者LLM监督集合。斯皮尔曼等级相关性显示,Gemini估值与TRUST-Pathos高度相关(rho = +0.664,p < 0.001),而emotion2vec估值不相关(rho = +0.097,p = 0.499)。我们进一步通过系统评估柏林情感语音数据库(EMO-DB)使用Gemini在开放注释范式下,证明标准SER基准语料库存在表演性演讲、文化偏见和类别不兼容性。我们的结果表明,基于LLM的多模态分析在捕捉语义定义的政治情感方面比单独的语音模型更有效,而语音特征仍对低层次唤醒度估计有帮助。未来的工作将扩展这种方法到视频分析中,结合面部表情和眼神。

英文摘要

We investigate whether acoustic emotion recognition models can serve as proxies for the Pathos dimension in political speech analysis, as operationalised by the TRUST multi-agent large language model (LLM) pipeline. Using a Bundestag plenary speech by Felix Banaszak (51 segments, 245 s) as a case study, we compare three analysis modalities: (1) emotion2vec_plus_large, an acoustic speech emotion recognition (SER) model whose continuous Arousal and Valence values are derived via post-hoc Russell Circumplex projection; (2) Gemini 2.5 Flash, an LLM analysing the full speech audio together with its transcript in an open-ended, context-aware fashion; and (3) TRUST-Pathos scores from a three-advocate LLM supervisor ensemble. Spearman rank correlations reveal that Gemini Valence correlates strongly with TRUST-Pathos (rho = +0.664, p < 0.001), whereas emotion2vec Valence does not (rho = +0.097, p = 0.499). We further demonstrate, via a systematic quality evaluation of the Berlin Database of Emotional Speech (EMO-DB) using Gemini in an open-ended annotation paradigm, that standard SER benchmark corpora suffer from acted speech, cultural bias, and category incompatibility. Our results suggest that LLM-based multimodal analysis captures semantically defined political emotion substantially better than acoustic models alone, while acoustic features remain informative for low-level Arousal estimation. Future work will extend this approach to video-based analysis incorporating facial expression and gaze.

2605.22717 2026-05-22 cs.SD cs.AI cs.LG cs.MM 版本更新

Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

实时音乐扩散模型:交互式音乐生成扩散模型的高效微调与后训练

Zachary Novack, Stephen Brade, Haven Kim, Hugo Flores García, Nithya Shikarpur, Chinmay Talegaonkar, Suwan Kim, Valerie K. Chen, Julian McAuley, Taylor Berg-Kirkpatrick, Cheng-Zhi Anna Huang

发表机构 * UC San Diego(加州大学圣迭戈分校) MIT(麻省理工学院) Adobe(Adobe公司)

AI总结 本文研究了音频扩散模型能否通过块级KV缓存高效地转化为交互式模型,从而在消费级硬件上实现。提出的Live Music Diffusion Models (LMDMs)通过块级KV缓存恢复并超越了离散Live Music Models (LMMs)的推理复杂度,并通过ARC-Forcing范式实现稳定的后训练对齐,从而在无需显式RL或奖励模型的情况下减少误差累积。

详情
AI中文摘要

交互式流式音乐生成承诺了生成模型在实时表演和协作创作中的应用,这在离线模型中是无法实现的。然而,最先进的模型存在于离散AR领域,需要工业级的计算资源进行训练和推理。在本文中,我们研究音频扩散模型是否可以被重新利用为交互式模型,从而在消费级硬件上实现。通过仔细分析现代块级外推扩散流程,我们发现推理过程中存在关键的低效问题,导致其计算效率严劣于离散AR模型。我们提出了Live Music Diffusion Models (LMDMs),一种简单的生成扩散过程修改,通过块级KV缓存恢复并超越了离散Live Music Models (LMMs)的推理复杂度。与LMMs不同,LMDMs进一步通过我们新颖的ARC-Forcing范式实现稳定的后训练对齐,无需任何显式RL或奖励模型即可减少误差累积。我们展示了LMDMs在多个创意领域中的应用,包括文本条件生成、基于草图的音乐合成和即兴演奏。最后,我们展示了如何将LMDMs用作生成乐器,在真实艺术家与AI的合作中利用LMDMs作为“生成延迟”,将音乐家的即兴演奏转换为可变的音色效果,同时在本地消费级游戏笔记本电脑上运行。

英文摘要

Interactive streaming music generation promises the use of generative models for live performance and co-creation that is impossible with offline models. However, SOTA models exist in the discrete-AR regime, requiring industrial levels of compute for both training and inference. In this work, we investigate whether audio diffusion models, with their wide support in the open-source community but non-streaming bidirectional nature, can be repurposed efficiently into interactive models accessible on consumer hardware. By taking a critical look at the modern pipeline for block-wise outpainting diffusion, we identify critical inefficiencies during inference that result in strictly worse computational efficiency than their discrete-AR counterparts. We propose Live Music Diffusion Models (LMDMs), a simple modification of the generative diffusion process that recovers, and then outperforms, the inference complexity of the discrete Live Music Models (LMMs) through block-wise KV Caching. Unlike LMMs, LMDMs further enable stable post-training alignment through our novel ARC-Forcing paradigm, reducing error accumulation without any explicit RL or reward models. We demonstrate the application of LMDMs in a number of creative domains, including text-conditioned generation, sketch-based music synthesis, and jamming. We finally show how LMDMs can be used as a generative instrument in a real artist-AI collaboration, utilizing LMDMs as a "generative delay" to transform musicians' improvisation live for variable timbral effects while running locally on a consumer gaming laptop.

2605.22262 2026-05-22 cs.SD cs.LG eess.AS 版本更新

Automatic Contextual Audio Denoising

自动上下文音频去噪

Diep Luong, Konstantinos Drossos, Mikko Heikkinen, Tuomas Virtanen

发表机构 * Tampere University(塔尔皮奥大学) Nokia(诺基亚)

AI总结 本文提出了一种自动上下文音频去噪方法,通过推断音频场景类别来区分有用和无关声音成分,从而提高去噪效果。

详情
AI中文摘要

音频上下文决定了哪些声音成分和来源是相关的,哪些可以被听众感知为无关(噪声)。例如,在城市监控中交通噪声是有信息的,而在同一地点的电话通话中则为噪声。大多数当前的音频去噪系统使用固定的目标-噪声定义,往往在一种上下文中去除有用成分而在另一种上下文中无法抑制无关成分。为此,我们引入了自动上下文音频去噪(ACAD)的概念,该概念基于推断的上下文定义目标和噪声。在本工作中,我们将上下文限制为与声学场景类别相关联。我们将场景类别外的事件分布之外的声音事件(噪声)标记为离上下文(OC),而典型于该场景的事件标记为在上下文中(IC)。我们实现了一种深度学习方法,该方法能够自动推断音频信号的上下文并去除OC成分,并将其与无上下文推断、有 oracle 上下文和单独提供无信息上下文的变体进行比较。在跨多样上下文的配对干净/噪声数据上,其中一种上下文中的OC成分可能在另一种上下文中是IC,我们的方法在标准客观指标上优于其他方法,表明模型能够推断上下文,并且上下文依赖的处理可以增强去噪。

英文摘要

Audio context determines which sound components and sources are relevant and which can be perceived as irrelevant (noise) by listeners. For example, traffic noise is informative in urban surveillance but noise for a phone call at the same location. Most current audio denoising systems apply fixed target-noise definitions, often removing useful components in one context while failing to suppress irrelevant components. To address this, we introduce the concept automatic contextual audio denoising (ACAD) which defines target and noise based on the inferred context. In this work, we restrict context to be associated with an acoustic scene class. We label sound events outside the event distribution of a scene class (noise) as out-of-context (OC) and events typical for that scene as in-context (IC). We implement a deep learning method that automatically infers the context of the audio signal and removes OC components, and benchmark it against variants: without context inference, with oracle context, and with separately provided uninformative context. On paired clean/noisy data across diverse contexts, where OC components in one context may be IC in another, our proposed method outperforms other approaches across standard objective metrics, indicating that the model can infer context and context-dependent processing can enhance denoising.

2605.22120 2026-05-22 eess.AS cs.SD 版本更新

Effective User-defined Keyword Spotting with Dual-stage Matching, Multi-modal Enrollment, and Continual Adaptation

高效的用户定义关键词侦测:双阶段匹配、多模态注册与持续适应

Zhiqi Ai, Han Cheng, Shiyi Mu, Xinnuo Li, Yongjin Zhou, Shugong Xu

发表机构 * New York University(纽约大学) Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学)

AI总结 本文提出DMA-KWS框架,通过双阶段匹配、多模态注册和持续适应方法,解决用户定义关键词侦测中的混淆词区分、说话人发音不一致和高数据成本问题,实验表明其在LibriPhrase Hard子集上达到97.85%的AUC和6.13%的EER,性能领先。

Comments 14 pages, 13 figures, 12 tables. Accepted by TASLP

详情
AI中文摘要

用户定义关键词侦测(KWS)对于个性化语音交互至关重要,但现有方法面临几个挑战:(1)混淆词之间的区分度不足,(2)在发音不同的说话人之间性能不一致,(3)高数据成本以确保可靠的唤醒词性能。本文介绍DMA-KWS,一种高效的、稳健的用户定义关键词侦测框架。首先,它采用双阶段匹配流程:CTC解码结合流式音素搜索来定位候选段,随后使用QbyT结合音素匹配器进行精细验证,使其能够更好地区分混淆词。接下来,多模态注册融合用户特定的语音与文本嵌入,进一步提高已注册用户的准确性。最后,参数高效的持续适应机制通过合成和真实数据进行轻量级更新。广泛的实验表明DMA-KWS的优越性能。在LibriPhrase Hard子集上,它实现了97.85%的AUC和6.13%的EER,达到最先进的性能。在说话人依赖设置中,DMA-KWS始终优于文本-only注册,显示出显著的性能提升。此外,所提出的参数高效的微调机制仅需187k个更新参数即可适应DMA-KWS,进一步提高KWS性能,同时确保适用于设备部署。

英文摘要

User-defined keyword spotting (KWS) is crucial for personalized voice interaction, yet existing methods face several challenges: (1) insufficient discriminability among confusable words, (2) performance inconsistency across speakers with varying pronunciations, and (3) high data cost to ensure reliable wake-word performance. In this paper, we introduce DMA-KWS, an efficient and robust framework for user-defined keyword spotting. First, it adopts a dual-stage matching pipeline: CTC decoding with streaming phoneme search to locate candidate segments, followed by QbyT with a phoneme matcher for fine-grained verification, enabling it to better distinguish confusable words. Next, multi-modal enrollment fuses user-specific speech with text embeddings to further improve accuracy for registered users. Finally, a parameter-efficient continual adaptation mechanism performs lightweight updates using synthetic and real data. Extensive experiments demonstrate the superior performance of DMA-KWS. On the LibriPhrase Hard subset, it achieves 97.85% AUC and 6.13% EER, reaching state-of-the-art performance. In speaker-dependent settings, DMA-KWS consistently outperforms text-only enrollment, demonstrating significant performance gains. Moreover, the proposed parameter-efficient fine-tuning mechanism adapts DMA-KWS with only 187k updated parameters, further enhancing KWS performance while ensuring suitability for on-device deployment.

2605.21143 2026-05-22 cs.SD cs.LG 版本更新

CoarseSoundNet: Building a reliable model for ecological soundscape analysis

CoarseSoundNet:构建一个可靠的生态声音景观分析模型

Alexander Gebhard, Andreas Triantafyllopoulos, Dominik Arend, Sandra Müller, Svenja Schmidt, Michael Scherer-Lorenzen, Björn W. Schuller

发表机构 * organization= TUM University Hospital, CHI -- Chair of Health Informatics , addressline= Ismaninger Str. 22 , city= Munich , postcode= 81675 , state= Bavaria , country= Germany organization= University of Freiburg, Faculty of Biology, Geobotany , addressline= Schaenzlestr. 1 , city= Freiburg , postcode= 79104 , state= Baden-Württemberg , country= Germany organization= MCML -- Munich Center for Machine Learning , city= Munich , state= Bavaria , country= Germany organization= Imperial College London, GLAM -- Group on Language, Audio, \& Music , city= London , country= UK

AI总结 本文提出CoarseSoundNet模型,用于在真实噪声环境下对生物声音、地质声音和人类声音进行分类,并通过系统研究模型架构、训练数据和评估策略,提高了模型在被动声学监测中的泛化能力。

Comments Currently under review

详情
AI中文摘要

声音景观由三种声音组成:生物声音(动物发出的声音)、地质声音(自然非生物声音)和人类声音(人类发出的声音)。在声音景观生态学领域,一个关键研究问题是这些组成部分如何相互作用,特别是生物声音如何响应地质声音和人类声音。然而,目前尚缺乏能够对这些元素进行区分量化分析的工具。最近的机器学习(ML)方法旨在支持自动化分析,但通常依赖于任务特定或干净的数据,限制了其在噪声被动声学监测(PAM)记录中的泛化能力。本文提出了一种清晰且可重复的结构来构建用于粗粒度声音景观分类的ML模型,并引入了CoarseSoundNet,一个经过训练以在真实PAM条件下区分生物声音、地质声音和人类声音的深度学习模型。我们系统地研究了模型架构、额外训练类的影响、数据组成和评估策略。我们的发现表明,模型性能随着额外PAM数据的增加而提高,特别是当数据与目标领域相似时,并且通过在训练中引入显式的静默类进一步提高性能。类特定的决策阈值和基于持续时间的约束进一步提高了性能,特别是在人类声音和地质声音方面。错误分析显示,人类声音由于掩蔽效应而面临挑战,而静默和昆虫声音在地质和生物声音方面存在混淆。最后,我们进行了一项生态案例研究,表明使用CoarseSoundNet预过滤记录可以产生与地面真实过滤相当的声学指数趋势,支持其作为生态声学分析有效预处理工具的使用。

英文摘要

A soundscape is composed of three types of sound: biophony (sounds made by animals), geophony (natural abiotic sounds) and anthropophony (sounds made by humans). A key research question in the field of soundscape ecology is how these components interact with each other, specifically how biophony responds to geophony and anthropophony. Nevertheless, as of today, there are not many analytical instruments that enable the distinct quantification of these elements. Recent machine learning (ML) approaches aim to support automated analysis but often rely on task-specific or clean data, limiting generalisation to noisy passive acoustic monitoring (PAM) recordings. This study presents a clear and reproducible structure to build ML models for coarse soundscape classification and introduces CoarseSoundNet, a deep learning model trained to distinguish biophony, geophony, and anthropophony under realistic PAM conditions. We systematically investigate model architectures, the influence of an additional training class, data composition, and evaluation strategies. Our findings suggest that model performance improves with additional PAM data, especially when similar to the target domain, and by introducing an explicit silence class during training. Class-specific decision thresholds and duration-based constraints further enhance performance, particularly for anthropophony and geophony. Error analyses exhibit challenges for anthropophony due to masking effects and confusions for silence and insect sounds for geophony and biophony. Finally, we conduct an ecological case study which shows that pre-filtering recordings with CoarseSoundNet yields acoustic index trends comparable to ground-truth filtering, supporting its use as an effective preprocessing tool for ecoacoustic analyses.

2605.20578 2026-05-22 cs.SD cs.CV 版本更新

A strongly annotated passive acoustic dataset for tropical bird monitoring

一个强注解的被动声学数据集用于热带鸟类监测

Daniela Ruiz, Juan Sebastián Ulloa, Zhongqi Miao, Nicolás Betancourt, Maria Paula Toro-Gómez, Andrés Hernández, Bruno Demuro, Eliana Barona-Cortés, Angela Mendoza-Henao, Andrés Sierra-Ricaurte, Sebastián Pérez-Peña, Rahul Dodhia, Pablo Arbeláez, Juan M. Lavista Ferres

发表机构 * Microsoft AI for Good Research Lab(微软AI for Good研究实验室) Instituto de Investigación de Recursos Biológicos Alexander von Humboldt(亚历山大冯·洪堡生物资源研究所) Center for Research and Formation in Artificial Intelligence(人工智能研究与培养中心) Fundación Manacus(曼卡斯基金会) Louisiana State University(路易斯安那州立大学) Museum of Natural Sciences(自然博物馆)

AI总结 本文提出PteroSet数据集,用于热带鸟类监测,通过强注解的音频数据和COCO-inspired JSON格式,为机器学习提供基准,并展示了二元鸟类检测的深度学习基线。

详情
AI中文摘要

被动声学监测能够实现对多样化生态系统的连续、非侵入性生物多样性评估。这些数据集的规模推动了机器学习的应用,监督方法表现出强劲的性能。然而,监督方法需要时间分辨的注解数据集,这些数据仍然稀缺,尤其是在复杂的热带声音景观中。我们提出了PteroSet,这是一个经过精心编纂的数据集,包含在哥伦比亚Putumayo的Puerto Asis和Magdalena的Pivijay之间2023年至2025年录制的强注解新热带鸟类叫声数据集。该数据集包含563个录音(73.62小时)和15,372个时频注解,包括6,702个事件,这些事件被识别到物种水平,涵盖168个物种。我们以COCO启发的JSON模式发布注解,将音频文件、分类类别和机器学习工作流程的标签统一起来。除了提供注解数据外,PteroSet还充当一个现实的基准,突显了热带声音景观的关键特征,包括不同录制地点的声学共现和领域转移。我们提供了一个二元鸟类检测的深度学习基线,展示了PteroSet的可用性和其带来的挑战。

英文摘要

Passive acoustic monitoring enables continuous, non-invasive biodiversity assessment across diverse ecosystems. The scale of these datasets has driven the adoption of machine learning, with supervised approaches showing strong performance. However, supervised methods require time-resolved annotated datasets, which remain scarce, especially in complex tropical soundscapes. We present PteroSet, a curated dataset of strongly annotated Neotropical bird vocalizations recorded in Puerto Asis (Putumayo) and Pivijay (Magdalena), Colombia, between 2023 and 2025. The dataset comprises 563 recordings (73.62 h) and 15,372 time-frequency annotations, including 6,702 events identified to the species level across 168 species. We release the annotations in a COCO-inspired JSON schema that unifies audio files, taxonomic categories, and labels for machine learning workflows. Beyond providing annotated data, PteroSet serves as a realistic benchmark that highlights key characteristics of tropical soundscapes, including acoustic co-occurrence and domain shift across recording sites. We provide a deep learning baseline for binary bird detection, demonstrating PteroSet's usability and the challenges it presents.

2605.03934 2026-05-22 cs.SD cs.AI 版本更新

Towards Open World Sound Event Detection

面向开放世界的声音事件检测

P. H. Hai, L. T. Minh, L. H. Son

发表机构 * VNU University of Engineering and Technology(越南工程大学) Artificial Intelligence Research Center, VNU Information Technology Institute(VNU信息技术研究所人工智能研究中心)

AI总结 本文提出了一种开放世界声音事件检测(OW-SED)范式,通过引入可变形架构和新颖的WOOT框架,解决了重叠和模糊事件的挑战,提升了在开放世界环境下的检测性能。

Comments 32 pages, 3 figures. Accepted to Signal Processing (Elsevier)

详情
Journal ref
Signal Processing, Article 110707, 2026
AI中文摘要

声音事件检测(SED)在音频理解中起着至关重要的作用,应用于监控、智能城市、医疗保健和多媒体索引等领域。然而,传统SED系统基于封闭世界假设,限制了其在现实环境中处理新兴声音事件的能力。受开放世界学习在计算机视觉中的成功启发,我们引入了开放世界声音事件检测(OW-SED)范式,其中模型必须检测已知事件、识别未知事件并逐步学习它们。为了解决OW-SED特有的挑战,如重叠和模糊事件,我们提出了一种1D可变形架构,利用可变形注意力来适应性地聚焦于显著的时序区域。此外,我们设计了一种新颖的开放世界可变形声音事件检测转换器(WOOT)框架,结合特征解耦来分离类特定和类无关的表示,以及一种一对多匹配策略和多样性损失以增强表示多样性。实验结果表明,我们的方法在封闭世界设置中相比现有领先技术略具优势,并在开放世界场景中显著优于现有基线。

英文摘要

Sound Event Detection (SED) plays a vital role in audio understanding, with applications in surveillance, smart cities, healthcare, and multimedia indexing. However, conventional SED systems operate under a closed-world assumption, limiting their effectiveness in real-world environments where novel acoustic events frequently emerge. Inspired by the success of open-world learning in computer vision, we introduce the Open-World Sound Event Detection (OW-SED) paradigm, where models must detect known events, identify unseen ones, and incrementally learn from them. To tackle the unique challenges of OW-SED, such as overlapping and ambiguous events, we propose a 1D Deformable architecture that leverages deformable attention to adaptively focus on salient temporal regions. Furthermore, we design a novel Open-World Deformable Sound Event Detection Transformer (WOOT) framework incorporating feature disentanglement to separate class-specific and class-agnostic representations, together with a one-to-many matching strategy and a diversity loss to enhance representation diversity. Experimental results demonstrate that our method achieves marginally superior performance compared to existing leading techniques in closed-world settings and significantly improves over existing baselines in open-world scenarios.

2601.18094 2026-05-22 eess.AS cs.SD 版本更新

OneVoice: One Model, Triple Scenarios-Towards Unified Zero-shot Voice Conversion

OneVoice: 一个模型,三种场景——迈向统一的零样本语音转换

Zhichao Wang, Tao Li, Wenshuo Ge, Zihao Cui, Shilei Zhang, Junlan Feng

发表机构 * JIUTIAN Research(钧天研究院) China Mobile(中国移动) Beijing, China(北京,中国)

AI总结 本文提出OneVoice,一种能够统一处理语音转换三种场景(语音克隆、语言保护和歌唱)的零样本框架,通过混合专家机制和双路径路由机制实现统一建模,并采用两阶段训练策略解决数据不平衡问题。

详情
AI中文摘要

最近语音转换(VC)的进展在说话人克隆和语言保护方面达到了新的里程碑。但该领域仍碎片化,依赖专门模型处理语言保护、表达和歌唱场景。我们提出OneVoice,一个统一的零样本框架,能够在单一模型中处理所有三种场景。OneVoice基于一个连续语言模型,通过无VAE的next-patch扩散进行训练,确保高保真和高效的序列建模。其统一设计的核心在于混合专家(MoE)机制,旨在显式建模共享的转换知识和场景特定的表达性。专家选择由双路径路由机制协调,包括共享专家隔离和场景感知的领域专家分配,结合全局-局部线索。为了精确条件化,场景特定的音调特征通过门控机制融合到每一层,允许适应性地使用音调信息。此外,为了实现核心思想并缓解数据不平衡问题(语音数据丰富,歌唱数据稀缺),我们采用两阶段渐进训练,包括基础预训练和使用LoRA基于的领域专家的场景增强。实验表明,OneVoice在所有三种场景中与专用模型匹配或超越,同时验证了灵活的场景控制,并提供了一种快速解码版本,仅需几步即可。音频样本可在演示页面上获取。

英文摘要

Recent progress of voice conversion~(VC) has achieved a new milestone in speaker cloning and linguistic preservation. But the field remains fragmented, relying on specialized models for linguistic-preserving, expressive, and singing scenarios. We propose OneVoice, a unified zero-shot framework capable of handling all three scenarios within a single model. OneVoice is built upon a continuous language model trained with VAE-free next-patch diffusion, ensuring high fidelity and efficient sequence modeling. Its core design for unification lies in a Mixture-of-Experts (MoE) designed to explicitly model shared conversion knowledge and scenario-specific expressivity. Expert selection is coordinated by a dual-path routing mechanism, including shared expert isolation and scenario-aware domain expert assignment with global-local cues. For precise conditioning, scenario-specific prosodic features are fused into each layer via a gated mechanism, allowing adaptive usage of prosody information. Furthermore, to enable the core idea and alleviate the imbalanced issue (abundant speech vs. scarce singing), we adopt a two-stage progressive training that includes foundational pre-training and scenario enhancement with LoRA-based domain experts. Experiments show that OneVoice matches or surpasses specialized models across all three scenarios, while verifying flexible control over scenarios and offering a fast decoding version as few as 2 steps. Audio samples are available on demo page.

2509.15151 2026-05-22 cs.SD cs.AI 版本更新

Exploring How Audio Effects Alter Emotion with Foundation Models

探索音频效果如何通过基础模型改变情感

Stelios Katsis, Vassilis Lyberatos, Spyridon Kantarelis, Edmund Dervakos, Giorgos Stamou

AI总结 本文研究音频效果如何通过基础模型影响情感,探讨了基础模型在分析音频效果与情绪关系中的作用,揭示了声音设计技术对感知影响的模式。

Comments https://github.com/stelioskt/audioFX

详情
AI中文摘要

音频效果(如混响、失真、调制和动态范围处理)在音乐聆听过程中塑造情感反应中起着关键作用。尽管先前研究已探讨了低级音频特征与情感感知之间的联系,但音频效果对情绪的系统性影响仍被忽视。本文研究如何利用基础模型——大规模预训练于多模态数据的神经架构——来分析这些效果。此类模型编码了音乐结构、音色和情感意义之间的丰富关联,提供了一个强大的框架来探测声音设计技术的情感后果。通过应用各种探测方法到深度学习模型的嵌入中,我们考察了音频效果与估计情绪之间的复杂、非线性关系,揭示了与特定效果相关的模式,并评估了基础音频模型的鲁棒性。我们的发现旨在推进对音频制作实践感知影响的理解,对音乐认知、表演和情感计算具有启示意义。

英文摘要

Audio effects (FX) such as reverberation, distortion, modulation, and dynamic range processing play a pivotal role in shaping emotional responses during music listening. While prior studies have examined links between low-level audio features and affective perception, the systematic impact of audio FX on emotion remains underexplored. This work investigates how foundation models - large-scale neural architectures pretrained on multimodal data - can be leveraged to analyze these effects. Such models encode rich associations between musical structure, timbre, and affective meaning, offering a powerful framework for probing the emotional consequences of sound design techniques. By applying various probing methods to embeddings from deep learning models, we examine the complex, nonlinear relationships between audio FX and estimated emotion, uncovering patterns tied to specific effects and evaluating the robustness of foundation audio models. Our findings aim to advance understanding of the perceptual impact of audio production practices, with implications for music cognition, performance, and affective computing.

2410.18151 2026-05-22 cs.SD cs.LG cs.MM eess.AS 版本更新

Music102: An $D_{12}$-equivariant transformer for chord progression accompaniment

Music102: 一个 $D_{12}$-等价变换器用于和弦进行伴奏

Weiliang Luo

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文提出Music102,一种基于群论和音乐结构的等价变换器,用于提升和弦进行伴奏的质量,通过整合音乐对称性如转位和反射操作,改进了非等价变换器Music101的性能。

Comments 10 pages, 3 figures

详情
Journal ref
Proceedings of the 2025 International Computer Music Conference (https://hdl.handle.net/2027/fulcrum.zg64tq53m)
AI中文摘要

我们提出了Music102,一种先进的模型,旨在通过$D_{12}$-等价变换器增强和弦进行伴奏。受群论和音乐结构的启发,Music102利用音乐对称性--如转位和反射操作--将这些属性整合到变换器架构中。通过编码先前的音乐知识,模型在旋律和和弦序列上保持等价性。使用POP909数据集训练和评估Music102,结果显示其在加权损失和精确准确度指标上均优于非等价变换器Music101原型,尽管参数更少。这项工作展示了自注意力机制和层归一化在离散音乐领域中的适应性,解决了计算音乐分析中的挑战。凭借其稳定且灵活的神经框架,Music102为等价音乐生成和计算音乐创作工具的进一步探索奠定了基础,将数学理论与实际音乐表演相结合。

英文摘要

We present Music102, an advanced model aimed at enhancing chord progression accompaniment through a $D_{12}$-equivariant transformer. Inspired by group theory and symbolic music structures, Music102 leverages musical symmetry--such as transposition and reflection operations--integrating these properties into the transformer architecture. By encoding prior music knowledge, the model maintains equivariance across both melody and chord sequences. The POP909 dataset was employed to train and evaluate Music102, revealing significant improvements over the non-equivariant Music101 prototype Music101 in both weighted loss and exact accuracy metrics, despite using fewer parameters. This work showcases the adaptability of self-attention mechanisms and layer normalization to the discrete musical domain, addressing challenges in computational music analysis. With its stable and flexible neural framework, Music102 sets the stage for further exploration in equivariant music generation and computational composition tools, bridging mathematical theory with practical music performance.

2605.22083 2026-05-22 cs.SD cs.LG eess.AS 版本更新

RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching

RobustSpeechFlow: 通过基于增强的对比流匹配学习鲁棒的文本到语音轨迹

Jinhyeok Yang, Hyeongju Kim, Yechan Yu, Joon Byun, Frederik Bous, Juheon Lee

发表机构 * Supertone Inc(Supertone公司) Independent Researcher(独立研究者)

AI总结 本文提出RobustSpeechFlow,一种通过引入长度保持重复和跳过潜在增强来改进对齐鲁棒性的训练策略,从而在无需外部对齐器或偏好数据的情况下,直接惩罚现实中的失败模式,并能无缝集成到现有流程中,实验表明其在文本到语音任务中显著提升了语音质量与鲁棒性。

Comments Submitted to INTERSPEECH 2026

详情
AI中文摘要

尽管流匹配文本到语音(TTS)在零样本说话人相似性和自然度方面表现强劲,但仍易受内容保真度问题影响,特别是由于不完美的对齐导致的跳过和重复错误。我们提出了RobustSpeechFlow,一种训练策略,通过扩展对比流匹配,引入长度保持重复和跳过潜在增强来提高对齐鲁棒性。该方法无需外部对齐器或偏好数据,直接惩罚现实中的失败模式,并能无缝集成到现有流程中。在Seed-TTS-eval上,仅使用0.06B参数,其将词错误率(WER)从1.44降至1.38。在我们的ZERO500基准测试中,它在多样化的说话人和语调条件下实现了稳定的可理解性提升;在NFE=24时,其将英文字符错误率(CER)从0.48%降至0.35%,将韩文CER从0.81%降至0.57%。音频样本:https://robustspeechflow.github.io/

英文摘要

While flow-matching text-to-speech (TTS) achieves strong zero-shot speaker similarity and naturalness, it remains susceptible to content fidelity issues, particularly skip and repeat errors from imperfect alignment. We propose RobustSpeechFlow, a training strategy that improves alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations. Requiring no external aligners or preference data, our method directly penalizes realistic failure modes and readily integrates into existing pipelines. On Seed-TTS-eval, it reduces the word error rate (WER) from 1.44 to 1.38 using only 0.06B parameters. On our ZERO500 benchmark, it delivers consistent intelligibility improvements across diverse speaker and prosody conditions; at NFE=24, it reduces English character error rate (CER) from 0.48\% to 0.35\% and Korean CER from 0.81\% to 0.57\%. Audio samples: https://robustspeechflow.github.io/

2605.21538 2026-05-22 cs.SD 版本更新

Academic Text-to-Music Grand Challenge: Datasets, Baselines, and Evaluation Methods

学术文本到音乐大奖赛:数据集、基线和评估方法

Fang-Chih Hsieh, Wei-Jaw Lee, Chun-Ping Wang, Hung-yi Lee, Hao-Wen Dong, Yi-Hsuan Yang

发表机构 * Artificial Intelligence Center of Research Excellence, National Taiwan University, Taipei, Taiwan(台湾大学人工智能研究中心) Department of Performing Arts Technology, University of Michigan, Ann Arbor, MI, United States(密歇根大学表演艺术技术系)

AI总结 本文介绍了ICME 2026学术文本到音乐生成大奖赛(ATTM)的概述和技术框架。尽管文本到音乐生成(TTM)系统取得了快速进展,但该领域目前主要由在大规模专有数据集上训练的模型主导,这些模型使用工业级计算资源,给学术研究带来了显著障碍。为此,ATTM挑战赛建立了一个公平的基准,要求参赛者使用标准化的、采用CC许可的MTG-Jamendo数据集子集(仅包含纯音乐)从头开始训练生成模型。该挑战分为两个赛道:效率赛道(限制在5亿参数以内)和性能赛道(无参数限制)。提交将通过多阶段评估过程进行评估,包括客观指标,如Fréchet音频距离、CLAP分数和新的概念覆盖分数(CCS),随后进行主观听觉测试。通过提供开源基线、预处理管道、参考标题和公开计算FAD和CLAP的评估代码,该挑战旨在促进学术环境中的TTM研究。

Comments Accepted to IEEE ICME 2026 Grand Challenge Paper

详情
AI中文摘要

本文介绍了ICME 2026学术文本到音乐生成大奖赛(ATTM)的概述和技术框架。尽管文本到音乐生成(TTM)系统取得了快速进展,但该领域目前主要由在大规模专有数据集上训练的模型主导,这些模型使用工业级计算资源,给学术研究带来了显著障碍。为此,ATTM挑战赛建立了一个公平的基准,要求参赛者使用标准化的、采用CC许可的MTG-Jamendo数据集子集(仅包含纯音乐)从头开始训练生成模型。该挑战分为两个赛道:效率赛道(限制在5亿参数以内)和性能赛道(无参数限制)。提交将通过多阶段评估过程进行评估,包括客观指标,如Fréchet音频距离、CLAP分数和新的概念覆盖分数(CCS),随后进行主观听觉测试。通过提供开源基线、预处理管道、参考标题和公开计算FAD和CLAP的评估代码,该挑战旨在促进学术环境中的TTM研究。

英文摘要

This paper presents an overview and the technical framework of the ICME 2026 Grand Challenge on Academic Text-to-Music Generation (ATTM). Despite the rapid progress in text-to-music generation (TTM) systems, the field is currently dominated by models trained on massive proprietary datasets with industrial-scale computational resources, creating a significant barrier for academic research. To address this, the ATTM Challenge establishes a fair-play benchmark that requires participants to train generative models strictly from scratch using a standardized, CC-licensed subset of the MTG-Jamendo dataset containing only instrumental music. The challenge is divided into two tracks: the Efficiency Track (limited to 500M parameters) and the Performance Track (no parameter limit). Submissions are evaluated through a multi-stage process involving objective metrics, including Frechet Audio Distance, CLAP score, and a novel Concept Coverage Score (CCS), followed by a subjective listening test. By providing open-source baselines, preprocessing pipelines, reference captions, and public evaluation code for computing FAD and CLAP, this challenge aims to facilitate and promote TTM research in academic contexts.

2605.19955 2026-05-22 cs.CR cs.SD 版本更新

DASM: Domain-Aware Sharpness Minimization for Multi-Domain Voice Stream Steganalysis

DASM:多领域语音流隐写分析中的领域感知锐度最小化

Pengcheng Zhou, Pianran Guo, Shuhua Chen, Mengqin Zhao, Zhongliang Yang, Linna Zhou

发表机构 * Department of Electrical and Computer Engineering, National University of Singapore(新加坡国立大学电子与计算机工程系) School of Cyberspace Security, Beijing University of Posts and Telecommunications(北京邮电大学网络空间安全学院) College of Communication Engineering, Jilin University(吉林大学通信工程学院)

AI总结 本文提出DASM,一种领域感知锐度最小化方法,通过结合领域监督对比学习和锐度感知优化,提升多领域语音流隐写分析的鲁棒性和泛化能力。

详情
AI中文摘要

随着信息隐藏在网络流媒体中的广泛应用,其用于隐蔽通信的安全威胁日益加剧,亟需开发鲁棒的检测技术。然而,现有网络语音流隐写分析方法主要依赖特定场景的数据分布,难以适应非同源数据分布的实践检测需求。通过Hessian分析,我们发现主流模型的损失景观被大量鞍点和尖锐局部极小值主导,使其对数据分布变化高度敏感,从根本上限制了泛化能力。因此,我们提出一种新的优化器,领域感知锐度最小化(DASM)。DASM的核心机制包括两个方面:首先,它结合领域监督对比学习和锐度感知优化,明确保持跨领域特征分离的同时寻找平坦极小值;其次,我们设计了一种自适应领域间隙调节策略,通过感知不同领域实时特征分离性动态校准优化损失权重。大量实验结果表明,我们的方法在很大程度上优于现有最先进方法,并实现了出色的泛化能力和鲁棒性。

英文摘要

The growing use of information hiding in network streaming media for covert communication poses a significant security threat, necessitating the development of robust detection technologies. However, existing steganalysis methods for network voice streams mostly rely on data distributions in specific scenarios, making it difficult to adapt to the practical detection needs of non-homologous data distributions. Through Hessian analysis, we find that the loss landscapes of mainstream models are dominated by numerous saddle points and sharp local minima, rendering them highly sensitive to data distribution shifts and fundamentally limiting generalization. Therefore, we propose a new optimizer, Domain-Aware Sharpness Minimization (DASM). The core mechanisms of DASM consist of two aspects: first, it integrates domain-supervised contrastive learning with sharpness-aware optimization, explicitly preserving inter-domain feature separation while seeking flat minima; second, we design an adaptive domain gap modulation strategy that dynamically calibrates the optimization loss weights by sensing the real-time feature separability of different domains. Extensive experimental results demonstrate that our method outperforms the state-of-the-art methods by a large margin and achieves excellent generalization and robustness.

2605.16304 2026-05-22 eess.SP cs.SD 版本更新

Modulation Feature Enhancement with a Multi-Stage Attention Network for Underwater Acoustic Target Recognition

基于多阶段注意力网络的调制特征增强用于水下声学目标识别

Jiaping Yu, Shefeng Yan, Linlin Mao, Zeping Sui, Chunjin Jiang

发表机构 * Institute of Acoustics, Chinese Academy of Sciences(中国科学院声学研究所) University of Chinese Academy of Sciences(中国科学院大学) School of Computer Science and Electronics Engineering, University of Essex(埃塞克斯大学计算机科学与电子工程学院)

AI总结 本文提出了一种基于变分模态分解和3/2-D频谱的特征提取与融合方法,结合多阶段多类型注意力机制和可调类平衡焦点损失,提升水下声学目标识别性能。

Comments 31 pages, 14 figures, Accepted by Signal Processing

详情
AI中文摘要

水下声学目标识别对于海洋应用至关重要,但面临船舶辐射噪声复杂多样的挑战。为解决这些问题,我们提出了一种稳健的深度学习框架。首先,我们引入基于变分模态分解(VMD)和3/2-D频谱的特征提取与融合方法,生成高保真的2-D DEMON频谱特征,有效捕捉调制包络信息。为进一步增强特征表示,我们设计了一种集成新型多阶段多类型注意力机制(MMATT)的一维卷积神经网络(1-D CNN),该机制能够自适应地在不同网络深度上优化特征。在此机制中,我们提出了一种残差通道独立频谱注意力机制(R-CISAM)和多尺度分离与融合频谱注意力机制(MS-SFSAM)。此外,为了缓解实际船舶辐射噪声数据中固有的严重类别不平衡导致的性能下降,我们设计了一种可调类平衡焦点损失(ACBFL),该损失函数在任务不平衡程度不同的情况下提供灵活性。在真实世界船舶辐射噪声数据集上的实验结果表明,所提出的方法有效提升了水下声学目标识别性能。

英文摘要

Underwater acoustic target recognition is critical for maritime applications, yet it faces challenges arising from the complex and diverse nature of ship-radiated noise. To address these issues, we propose a robust deep learning-based framework. First, we introduce a feature extraction and fusion method based on variational mode decomposition (VMD) and the 3/2-D spectrum to generate high-fidelity 2-D DEMON spectral features, which effectively capture modulation envelope information. To further enhance feature representation, we design a one-dimensional convolutional neural network (1-D CNN) integrated with a novel Multi-Stage Multi-Type Attention Mechanism (MMATT) that adaptively refines features at different network depths. Within this mechanism, we propose a Residual Channel-Independent Spectral Attention Mechanism (R-CISAM) and a Multi-Scale Separate-and-Fuse Spectral Attention Mechanism (MS-SFSAM). Moreover, to mitigate performance degradation caused by severe class imbalance inherent in real-world ship-radiated noise data, we devise an Adjustable Class-Balanced Focal Loss (ACBFL), which provides flexibility across tasks with varying degrees of imbalance. Experimental results on a real-world ship-radiated noise dataset demonstrate that the proposed solutions effectively enhance underwater acoustic target recognition performance.

2511.08093 2026-05-22 eess.AS cs.CL cs.SD 版本更新

Quantizing Whisper-small: How design choices affect ASR performance

对Whisper-small的量化:设计选择如何影响语音识别性能

Arthur Söhler, Julian Irigoyen, Andreas Søeborg Kirkedal

发表机构 * Copenhagen Business School(哥本哈根商学院) Danske Bank(丹麦银行) Jabra (GN Group)(Jabra(GN集团))

AI总结 本文研究了不同量化方案对Whisper-small模型性能的影响,发现动态int8量化在模型压缩和识别准确率之间取得了最佳平衡,同时展示了通过精心选择量化方法可以显著减少模型大小和推理成本,从而在受限硬件上实现高效部署。

Comments Accepted to SPEAKABLE workshop at LREC 2026

详情
AI中文摘要

大型语音识别模型如Whisper-small虽然能实现高精度,但其高计算需求使其难以在边缘设备上部署。为此,我们提出了一种统一的跨库评估,评估了Whisper-small上的后训练量化(PTQ)方法,以分离量化方案、方法、粒度和位宽的影响。我们的研究基于四个库:PyTorch、Optimum-Quanto、HQQ和bitsandbytes。在LibriSpeech测试清洁和测试其他数据集上的实验表明,动态int8量化结合Quanto提供了最佳的权衡,将模型大小减少57%,同时在基线的词错误率上有所提升。静态量化表现较差,可能由于Whisper的Transformer架构,而更激进的格式(如nf4、int3)在嘈杂条件下以牺牲准确性为代价实现了高达71%的压缩。总体而言,我们的结果表明,精心选择的PTQ方法可以在不重新训练的情况下显著减少模型大小和推理成本,从而在受限硬件上实现Whisper-small的高效部署。

英文摘要

Large speech recognition models like Whisper-small achieve high accuracy but are difficult to deploy on edge devices due to their high computational demand. To this end, we present a unified, cross-library evaluation of post-training quantization (PTQ) on Whisper-small that disentangles the impact of quantization scheme, method, granularity, and bit-width. Our study is based on four libraries: PyTorch, Optimum-Quanto, HQQ, and bitsandbytes. Experiments on LibriSpeech test-clean and test-other show that dynamic int8 quantization with Quanto offers the best trade-off, reducing model size by 57% while improving on the baseline's word error rate. Static quantization performed worse, likely due to Whisper's Transformer architecture, while more aggressive formats (e.g., nf4, int3) achieved up to 71% compression at the cost of accuracy in noisy conditions. Overall, our results demonstrate that carefully chosen PTQ methods can substantially reduce model size and inference cost without retraining, enabling efficient deployment of Whisper-small on constrained hardware.