arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.02448 2026-06-02 eess.SP cs.SD 版本更新

Diffusion-Based Heart Sound Generation: Evaluation with Physiological Signal Metrics, Classifiers, and Expert Listening

基于扩散的心音生成:使用生理信号指标、分类器和专家听诊评估

Xinqi Bao, Jia Bi, Xin Chen, Ernest Nlandu Kamavuako, Saikat Chatterjee

发表机构 * Department of Information Science & Engineering, KTH Royal Institute of Technology(信息科学与工程系,皇家理工学院) Rutherford Appleton Laboratory(拉瑟福德·苹果顿实验室) Peng Cheng Laboratory(鹏城实验室) Department of Engineering, King’s College London(工程系,伦敦国王学院)

AI总结 提出一种在log-mel域上的类别条件扩散模型用于生成心音图,通过生理指标、下游分类准确率和专家听诊评估合成保真度,并分析了异常声学线索保留和重建伪影等挑战。

详情
AI中文摘要

公开可用的心音图(PCG)数据集在规模和病理多样性方面仍然有限,限制了听诊训练和自动心音分类器的泛化能力。本文在log-mel域上开发了一种用于PCG生成的类别条件扩散模型,并使用互补的(i)生理启发的合理性指标、(ii)下游标签一致性评估和(iii)专家听诊来评估合成保真度。实验使用Phy-sioNet/Computing in Cardiology Challenge 2016数据集(3240条记录)进行记录级划分。经过预处理和质量控制后,将16,749个不重叠的4秒片段映射到归一化的1×128×128 log-mel表示,以训练带有无分类器引导的条件2D U-Net去噪器。使用三个轻量级指标在重建波形上量化信号级合理性:包络自相关节律评分、基于幅度的爆炸评分和主周期滞后。合成片段保留了相似的主周期持续时间,但与真实片段相比,包络周期性降低,瞬态突发性增加。在下游评估中,ResNet-50分类器在保留的真实测试集上达到92.24%的准确率,在类别平衡的合成批次上达到82.8%的准确率,表明生成信号保留了与正常/异常分类相关的判别结构。在一项初步的专家听诊研究(60个片段,两名临床医生)中,大多数合成片段被判断为类似心音,而真实和合成的4秒片段对异常敏感性均较低。总体而言,结果为基于扩散的PCG生成提供了实用基线,同时突出了在保留异常声学线索和减少重建伪影方面的剩余挑战。

英文摘要

Publicly available phonocardiogram (PCG) datasets remain limited in size and pathological diversity, constraining both auscultation training and the generalisation of automated heart-sound classifiers. A class-conditional diffusion model for PCG generation is developed in the log-mel domain and synthetic fidelity is assessed using complementary (i) physiology-inspired plausibility metrics, (ii) downstream label-consistency evaluation, and (iii) expert listening. Experiments use the Phy-sioNet/Computing in Cardiology Challenge 2016 dataset (3240 recordings) with recording-level splits. After preprocessing and quality control, 16,749 non-overlapping 4 s clips are mapped to a normalised 1 x 128 x 128 log-mel representation to train a conditional 2D U-Net denoiser with classifier-free guidance. Signal-level plausibility is quantified on reconstructed waveforms using three lightweight metrics: an envelope-autocorrelation rhythm score, an amplitude-based explosion score, and the dominant cycle lag. Synthetic clips preserve similar dominant cycle durations but exhibit reduced envelope periodicity and increased transient burstiness relative to real clips. For downstream evaluation, a ResNet-50 classifier achieves 92.24% accuracy on the held-out real test set and 82.8% accuracy on class-balanced synthetic batches, indicating that generated signals retain discriminative structure relevant to normal/abnormal classification. In a pilot expert listening study (60 clips, two clinicians), most synthetic clips are judged as heart-sound-like, while abnormality sensitivity is low for both real and synthetic 4 s excerpts. Overall, the results provide a practical baseline for diffusion-based PCG generation while highlighting remaining challenges in retaining abnormal acoustic cues and reducing reconstruction-induced artefacts.

2606.02212 2026-06-02 cs.SD 版本更新

C2GA: A Class-Controllable Generative Augmentation Framework for Respiratory Sound Classification

C2GA:一种用于呼吸音分类的类别可控生成式增强框架

Ziqi Ma, Mengyu Han, Anteng Cai, Zhanchong Liu, Bowen Feng, Hang Yu, Sheng Hu

发表机构 * School of Computer Engineering and Science, Shanghai University(上海大学计算机工程与科学学院) School of AI and Advanced Computing (AIAC), XJTLU Entrepreneur College (Taicang), Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学人工智能与高级计算学院) ISIR, Osaka University(大阪大学ISIR)

AI总结 针对呼吸音分类中数据有限、噪声大和类别不平衡问题,提出基于条件VQ-VAE和Transformer自回归先验的类别可控生成式增强框架C2GA,实现高保真、语义一致的样本生成。

Comments 18 pages, 5 figures, submitted to Computer Methods and Programs in Biomedicine

详情
AI中文摘要

背景:呼吸音分类在肺部病理的临床识别中起着关键作用。然而,其性能常受限于真实听诊数据集的规模小、噪声严重和类别不平衡。尽管传统的音频增强技术易于实现,但它们可能无意中扭曲微妙的病理特征。同时,现有的基于变分自编码器(VAE)或生成对抗网络(GAN)的生成方法往往面临样本保真度有限和对类别语义的可控性不足的问题,特别是在监督稀缺的情况下。方法:为克服这些限制,我们提出C2GA,一个类别可控的生成式增强框架。C2GA首先使用条件向量量化变分自编码器(VQ-VAE)构建一个语义丰富的离散潜在空间,其中局部声学标记与全局类别原型显式解耦。随后,训练一个基于Transformer的自回归先验以生成标签一致的标记序列。这些生成的标记随后与相应的类别原型融合,并解码为高保真Mel频谱图用于数据增强。结论:这些结果表明,C2GA为呼吸音分析提供了一种有效且语义可靠的增强策略。通过实现可控且高质量的数据生成,所提框架为提高真实临床场景中呼吸音分类的鲁棒性和泛化能力提供了一种有前景的解决方案。

英文摘要

Background: Respiratory sound classification plays a critical role in the clinical identification of pulmonary pathologies. However, its performance is often hindered by the limited size, severe noise, and class imbalance of real-world auscultation datasets. Although conventional audio augmentation techniques are easy to implement, they may inadvertently distort subtle pathological characteristics. Meanwhile, existing Variational Autoencoder (VAE)- or Generative Adversarial Network (GAN)-based generative approaches often suffer from limited sample fidelity and insufficient controllability over class semantics, particularly under conditions of scarce supervision. Methods: To overcome these limitations, we propose C2GA, a class-controllable generative augmentation framework. C2GA first constructs a semantically rich discrete latent space using a conditional Vector-Quantized Variational Autoencoder (VQ-VAE), in which local acoustic tokens are explicitly decoupled from global class prototypes. Subsequently, a Transformer-based autoregressive prior is trained to generate label-consistent token sequences. These generated tokens are then fused with the corresponding class prototypes and decoded into high-fidelity Mel-spectrograms for data augmentation. Conclusion: These results indicate that C2GA provides an effective and semantically reliable augmentation strategy for respiratory sound analysis. By enabling controllable and high-quality data generation, the proposed framework offers a promising solution for improving the robustness and generalization of respiratory sound classification in realistic clinical scenarios.

2606.02127 2026-06-02 eess.AS cs.SD 版本更新

Localizing broadband noise sources using the Loève spectrum and a 2.5D approach

使用Loève谱和2.5D方法定位宽带噪声源

Christian H. Kasess, Wolfgang Kreuzer, Holger Waubke

发表机构 * OeAW(奥埃阿维)

AI总结 针对移动宽带随机声源定位问题,提出一种基于2.5D设置和Loève谱的逆定位方法,推导了移动源功率谱密度与静态接收器Loève谱的关系,并通过多窗估计实现源定位。

Comments 31 pages, 13 figures

详情
AI中文摘要

使用麦克风阵列定位移动声源通常基于修改信号以补偿多普勒效应。在时域中,这种补偿是逐样本进行的。在频域中,需要使用短时间片段,其中假设多普勒效应近似恒定,并对每个片段进行离散傅里叶变换。相比之下,作者开发了一种针对均匀移动单频源的逆2.5D定位方法,该方法在谱域中工作,并允许使用更长的窗口。这是通过修改2.5D正向模型以直接计算运动在静态观察者位置的影响来实现的。该方法既不需要修改测量信号,也不需要在所使用的窗口内要求测量准平稳。不幸的是,这种方法不直接适用于宽带随机源,在本文中,我们将研究均匀移动随机源在静态观察者处观测时其统计特性如何变化。使用2.5D设置,推导了移动源功率谱密度与静态接收器处互谱密度推广形式——Loève谱之间的关系。基于速度高达100 m/s的模拟数据,本文提供了一种基于多窗估计Loève谱的方法的概念验证,用于定位移动宽带随机源。目前,该方法要求源信号平稳,并且谱密度在感兴趣频率附近的一定范围内平坦。此外,目前不考虑源之间的相关性。

英文摘要

The localization of moving sound sources using a microphone array is typically based on modifying the signal to compensate for the Doppler effect. In the time domain this compensation is done on a sample-by-sample basis. In the frequency domain short time segments need to be used in which the Doppler effect is assumed to be approximately constant and a discrete Fourier transform is done on each segment. In contrast, the authors developed an inverse 2.5D localization method for uniformly moving single-frequency sources that works in the spectral domain and allows for the use of longer windows. This was achieved by modifying the 2.5D forward model to directly compute the effect of the motion in the static observer position. The method does neither require to modify the measured signal nor does it require quasi-stationary of the measurements within the window used. Unfortunately, this approach is not directly suitable for broad-band stochastic sources, and in the present work we will investigate how the statistical properties of a uniformly moving stochastic source change when observed at a static observer. Using a 2.5D setting, the relation between the power spectral density of the moving source and the Loève spectrum, which is a generalization of the cross-spectral density at the static receivers, was derived. Based on simulated data with speeds up to 100 m\,s$^{-1}$, the work presented here provides a proof of concept for a method based on multi-taper estimates for the Loève spectrum to localize moving broad-band stochastic sources . Currently, the method requires a stationary source signal and that the spectral density is flat within a certain range around the frequency of interest. Also, correlations between sources are currently not considered.

2605.03384 2026-06-02 cs.CR cs.SD 版本更新

DECKER: Domain-invariant Embedding for Cross-Keyboard Extraction and Recognition

DECKER: 跨键盘提取与识别的域不变嵌入

Bikrant Bikram Pratap Maurya, Nitin Choudhury, Daksh Agarwal, Arun Balaji Buduru

发表机构 * IIIT-Delhi(印度德里理工学院) Guru Gobind Singh Indraprastha University(戈克辛格印度教大学)

AI总结 针对键盘声学侧信道攻击的跨键盘、跨用户和噪声环境泛化问题,提出包含四阶段域不变击键推理框架DECKER,并构建了多维度数据集HEAR,实验表明该方法在跨键盘和跨用户场景下显著提升击键识别性能。

Comments Accepted to AsiaCCS'26

详情
AI中文摘要

键盘上的声学侧信道攻击(ASCA)构成了重大的安全风险,因为击键可以从打字声音中推断出来,从而泄露敏感信息。先前的ASCA研究受限于小规模数据集,在用户、键盘和环境方面的多样性不足,限制了跨设备、麦克风和噪声条件的分析。我们引入了HEAR数据集,旨在沿着三个轴研究ASCA:键盘泛化、噪声适应和用户偏差。HEAR包含来自53名参与者使用37种笔记本电脑键盘的录音,在三种现实场景中收集:(1)外部麦克风捕获,(2)无网络噪声的设备麦克风捕获,以及(3)基于VoIP的流式捕获。这使得能够在用户、键盘和环境之间进行受控评估。在HEAR上,我们建立了一个ASCA基准,涵盖了单模态和多模态设置中来自原始音频和频谱图的传统特征和预训练表示。我们提出了DECKER,一个域不变的击键推理框架,包含四个阶段:(1)键盘签名归一化以减少设备着色,(2)域对抗解耦以抑制键盘身份,(3)有监督的跨键盘对比对齐以强制键一致性,以及(4)声学风格随机化以合成未见过的键盘响应。我们进一步探索了使用基于LLM的后处理层进行句子级推理,通过语言上下文优化击键序列。在HEAR上的结果表明,DECKER在跨键盘和跨用户设置中显著提高了击键识别性能,并通过语言模型校正进一步获得提升。这些发现强调,ASCA在多样化的用户、设备和噪声环境中仍然有效,凸显了其实际安全风险。

英文摘要

Acoustic side-channel attacks (ASCA) on keyboards pose a significant security risk, as keystrokes can be inferred from typing acoustics, revealing sensitive information. Prior ASCA studies are limited by small-scale datasets with restricted diversity in users, keyboards, and environments, constraining analysis across devices, microphones, and noise conditions. We introduce HEAR, a dataset designed to study ASCA along three axes: keyboard generalization, noise adaptation, and user bias. HEAR contains recordings from 53 participants using 37 laptop keyboards, collected in three realistic settings: (1) external microphone capture, (2) device microphone capture without network noise, and (3) VoIP-based streaming capture. This enables controlled evaluation across users, keyboards, and environments. On HEAR, we establish an ASCA benchmark spanning conventional features and pre-trained representations from raw audio and spectrograms in unimodal and multimodal settings. We propose DECKER, a domain-invariant keystroke inference framework with four stages: (1) Keyboard Signature Normalization to reduce device coloration, (2) domain-adversarial disentanglement to suppress keyboard identity, (3) supervised cross-keyboard contrastive alignment to enforce key consistency, and (4) Acoustic Style Randomization to synthesize unseen keyboard responses. We further explore sentence-level inference using an LLM-based post-processing layer to refine keystroke sequences via linguistic context. Results on HEAR show DECKER improves keystroke identification over strong baselines, particularly in cross-keyboard and cross-user settings, with further gains from language-model rectification. These findings highlight that ASCA remains effective across diverse users, devices, and noisy environments, underscoring its practical security risk.

2606.01909 2026-06-02 cs.SD cs.AI eess.AS 版本更新

Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space

Echo: 一种用于共享潜在空间中说话人日志和语音识别的联合嵌入预测架构

Louis Mouchon

发表机构 * Louis Mouchon(洛伊斯·莫尚)

AI总结 提出Echo系统,基于单个25M参数ViT编码器,通过JEPA预训练和分阶段特化,在512维潜在空间中联合实现说话人日志、语音分离和语音识别,无需部署时微调。

Comments 18 pages, 17 tables, 1 figure. Proof-of-concept, independent research

详情
AI中文摘要

我们提出Echo,一个围绕单个25M参数ViT编码器构建的概念验证音频系统。该编码器使用JEPA目标进行预训练,然后分阶段特化,以在同一个512维潜在空间中承载说话人身份、语音内容和动态源路由,部署时无需针对每个任务进行微调。轻量级头部处理说话人日志(ArcFace + VBx)和动态源分离(空目标K集预测)。在未知K的合成VoxCeleb2混合数据上,标准堆栈达到15.00%的盲DER、97.80%的PIT分离准确率,潜在SI-SDR提升+9.52 dB,以及在留出k-NN探针上说话人/内容因子化差距为+53.50分。Echo的意义不在于任何单一任务上的新SOTA,而在于三个任务在一个编码器上以这种规模共同共存。我们逐阶段记录了设计,报告了死胡同,并识别了通过VQ瓶颈进行端到端ASR的结构性障碍,该瓶颈仍然限制了PoC。

英文摘要

We present Echo, a proof-of-concept audio system built around a single 25 M-parameter ViT encoder. The encoder is pretrained with a JEPA objective and then specialised by stages to carry speaker identity, phonetic content, and dynamic source routing in the same 512-dimensional latent space, with no per-task fine-tuning at deployment. Light heads handle diarization (ArcFace + VBx) and dynamic source separation (null-target K-set prediction). On synthetic VoxCeleb2 mixtures with unknown K, the canonical stack reaches 15.00% blind DER, 97.80% PIT separation accuracy with +9.52 dB latent SI-SDR, and a +53.50-point speaker/content factorisation gap on a held-out k-NN probe. The point of Echo is not a new SOTA on any single task but the joint coexistence of three tasks on one encoder at this footprint. We document the design stage by stage, report the dead-ends, and identify the structural wall on end-to-end ASR through the VQ bottleneck that still bounds the PoC.

2606.01905 2026-06-02 eess.AS cs.SD 版本更新

Advancing Electrolaryngeal Speech Enhancement Through Speech-Text Representation Learning

通过语音-文本表示学习推进电喉语音增强

Ding Ma, Jinyi Mi, Fengji Li, Lester Phillip Violeta, Jiajun He, Wenchin Huang, Kazuhiro Kobayashi, Tomoki Toda

发表机构 * Graduate School of Informatics, Nagoya University(名古屋大学信息学研究科) School of Biological Science and Medical Engineering, Beihang University(北航生物医学工程学院) TARVO, Inc.(TARVO公司) Information Technology Center, Nagoya University(名古屋大学信息技术中心)

AI总结 提出一种融合语音和文本表示的学习框架,通过序列到序列语音转换模型改进电喉语音到正常语音的映射与重建质量,实验证明优于仅依赖语音表示的方法。

Comments 15 pages, 7 figures. Accepted to IEEE TBME

详情
Journal ref
IEEE Transactions on Biomedical Engineering, Early Access, 2026
AI中文摘要

目的:喉切除者依赖机电设备产生电喉(EL)语音。与正常语音相比,EL语音存在严重失真、有限的语音变化、不自然的韵律和时间偏移,降低了自然度和可懂度。尽管基于序列到序列(seq2seq)语音转换(VC)的EL语音到正常语音转换(EL2SP)很有前景,但EL与正常语音之间的显著不匹配不可避免地导致累积映射误差,限制了性能。为解决这一问题,我们描述了一种新颖的表示学习框架,该框架整合语音和文本表示,以改善seq2seq VC模型内的映射和重建质量。方法:我们的方法包括两个主要阶段:1)表示整合与学习,以及2)重建训练。首先构建一个能够融入辅助文本信息的网络,使用预训练模块学习基于语音-文本的整合表示。然后,采用自编码器风格的重建策略完成EL2SP模型,以继承这些表示而不增加模型复杂度。我们引入了三种融合策略,包括中级、输入级和混合级融合策略,逐步增强学习。此外,除了标准的seq2seq VC目标外,还引入了对整合表示的额外重建损失,以细化表示迁移。结果:在不同EL2SP数据集上的实验一致表明,我们的方法结合数据增强,优于仅依赖语音表示的基线方法。此外,随着系统设计深度的逐步改进验证了我们方法的有效性。意义:所提出的方法为EL语音增强和辅助通信技术提供了一种可扩展且实用的方法。

英文摘要

Objective: laryngectomees depend on an electromechanical device to generate electrolaryngeal (EL) speech. Compared with normal speech, EL speech suffers from severe distortion, limited phonetic variation, unnatural prosody, and temporal shifts, degrading naturalness and intelligibility. Although sequence-to-sequence (seq2seq) voice conversion (VC) based EL-speech-to-normal-speech conversion (EL2SP) is promising, substantial mismatches between EL and normal speech inevitably cause cumulative mapping errors that limit performance. To address this, we describe a novel representation learning framework integrating speech and text representations to improve mapping and reconstruction quality within a seq2seq VC model. Methods: our methodology comprises two main stages: 1) representation integration and learning, and 2) reconstruction training. A network capable of incorporating auxiliary text information is first constructed with pretrained modules to learn speech--text-based integrated representations. Then, an autoencoder-style reconstruction strategy finalizes EL2SP model to inherit these representations without increasing model complexity. We introduce three fusion strategies including middle-, input-, and hybrid-level fusion strategies that progressively enhance learning. Moreover, besides standard seq2seq VC objectives, an additional reconstruction loss on the integrated representation is introduced to refine representation transfer. Results: experiments under different EL2SP datasets consistently demonstrate that our methods, combined with data augmentations, outperform baselines relying solely on speech representations. Furthermore, progressive improvements with system design depth validate the effectiveness of our methods. Significance: the proposed methods provide an extensible and practical methodology for EL speech enhancement and assistive communication technologies.

2606.01703 2026-06-02 cs.SD cs.AI cs.CV 版本更新

JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions

JenBridge: 跨场景转换的自适应长视频配乐

Jiashuo Yu, Yao Yao, Boyu Chen, Alex Wang

发表机构 * Jen Music AI

AI总结 提出JenBridge框架,通过基于Transformer的生成模型、双文本-视觉条件对齐和LLM代理驱动的自适应过渡机制,实现长视频配乐的高保真生成与场景转换自然连贯。

详情
AI中文摘要

我们解决了在场景转换中生成高保真、长格式配乐并保持连贯性的挑战。现有的AI音乐系统主要针对短片段设计,缺乏确保叙事连续性的机制。我们提出了JenBridge,一个模块化且可解释的自适应长视频配乐框架,确保高保真音频生成和转换自然性。核心架构是一个基于Transformer的生成模型,采用流匹配目标训练,遵循两阶段范式:在大规模文本-音频语料库上进行预训练以建立稳健的音乐先验,然后通过双文本-视觉条件适应视频领域以实现精确的跨模态对齐。关键的是,为了实现跨不同场景变化的长格式连贯性,JenBridge引入了一种新颖的自适应过渡机制。该系统具有一个多功能的过渡风格工具包,包括一种生成式过渡方法,并独特地采用了一个大型语言模型(LLM)代理,作为导演智能地为每个叙事转变选择最合适的过渡。为了严格评估这一任务,我们提出了LVS基准,这是一个新基准,包含一个精选数据集和新的评估指标,侧重于整体和过渡感知评估。在提出的基准上进行的大量实验表明,JenBridge在客观和主观指标上均显著优于现有方法,特别是在转换自然性和整体叙事连贯性方面。JenBridge代表了向全自动、专业质量的视频配乐迈出的重要一步。

英文摘要

We address the challenge of generating high-fidelity, long-form soundtracks that remain coherent across scene transitions. Existing AI music systems are mainly designed for short, isolated clips and lack mechanisms to ensure narrative continuity. We present JenBridge, a modular and interpretable framework for adaptive long-form video soundtracking that ensures both high-fidelity audio generation and transition naturalness. The core architecture is a Transformer-based generative model trained with a flow-matching objective, following a two-stage paradigm: pretraining on large-scale text-audio corpora to establish robust musical priors, then adapting to the video domain with dual text-visual conditioning for precise cross-modal alignment. Crucially, to achieve long-form coherence across diverse scene changes, JenBridge incorporates a novel adaptive transition mechanism. This system features a versatile toolkit of transition styles, including a generative transition method, and uniquely employs a Large Language Model (LLM) Agent that acts as a director to select the most appropriate transition for each narrative shift intelligently. To rigorously assess this task, we propose the LVS Benchmark, a new benchmark that includes a curated dataset and novel evaluation metrics focusing on holistic and transition-aware assessment. Extensive experiments on the proposed benchmark demonstrate that JenBridge significantly outperforms existing methods in both objective and subjective metrics, particularly in terms of transition naturalness and overall narrative coherence. JenBridge represents a significant step towards fully automated, professional-quality video soundtracking.

2606.01686 2026-06-02 cs.SD cs.AI 版本更新

HAIM: Human-AI Music Datasets for AI Music Production Tracking Benchmark

HAIM: 用于AI音乐制作跟踪基准的人机音乐数据集

Seonghyeon Go, Yumin Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 针对当前AI音乐检测局限于二元分类的不足,提出HAIM数据集,通过多阶段标签定义“AI音乐跟踪”任务,评估现有检测器缺陷,推动向细粒度结构化评估转变。

详情
AI中文摘要

随着Suno和Udio等生成平台达到人类级音频质量,AI的实用性已扩展到整个音乐制作流程。除了简单的音轨生成,这些进步催生了AI驱动方法在各种形式中的应用,包括声音合成、编曲和专业母带处理。然而,当前的检测研究仍主要局限于二元“AI或人类”范式,未能反映当代音乐制作流程的现实。在真实制作中,AI工具越来越多地被用于优化或母带处理人类制作的音轨,而人类工程师同样对AI生成的材料进行后处理以确保专业质量。此外,用户经常采用对抗策略绕过AI检测器,例如对AI生成的音轨应用人类母带处理。这创造了一个简单的二元分类无法捕捉的灰色地带。在本文中,我们定义并研究“AI音乐跟踪”:在音乐制作的多面光谱中识别特定AI集成的挑战。为此,我们引入HAIM,一个具有音乐制作阶段多样化标签的数据集。它旨在隔离AI干预的阶段,包括混合制作和代理级跟踪。我们对最先进检测器的评估揭示了系统性缺陷。通过发布HAIM,我们提出了一个新的基准,将领域从二元分类转向对AI音乐的细粒度结构化评估。

英文摘要

As generative platforms such as Suno and Udio reach human-grade audio quality, the scope of AI's utility has expanded across the entire music production workflow. Beyond simple track generation, these advancements have catalyzed the adoption of AI-driven methodologies in diverse forms. These include vocal synthesis, arrangement, and professional mastering. However, current detection research remains largely confined to a binary `AI-or-human' paradigm. It fails to reflect the realities of contemporary music production workflows. In real-world production, AI tools are increasingly used to refine or master human-produced tracks, and human engineers likewise post-process AI-generated material to ensure professional quality. Moreover, users often employ adversarial tactics to bypass AI detectors, such as applying human mastering to AI-generated tracks. This creates a grey area that a simple binary classification fails to capture. In this paper, we define and investigate ``AI Music Tracking'': the challenge of identifying specific AI integration across the multifaceted spectrum of music production. To this end, we introduce HAIM, a dataset with diverse labels for stages of music production. It is designed to isolate stages of AI intervention, including hybrid production and agent-level tracking. Our evaluation of state-of-the-art detectors reveals systemic flaws. By releasing HAIM, we propose a new benchmark that shifts the field beyond binary classification toward a granular, structured evaluation of AI music.

2606.01677 2026-06-02 cs.SD 版本更新

UniVocal: Unified Speech-Singing Code-Switching Synthesis

UniVocal: 统一语音-歌唱代码切换合成

Yufei Shi, Qian Chen, Wen Wang, Xiangang Li, Zhen-Hua Ling, Yang Ai

发表机构 * Tongyi Fun Team, Alibaba Group(通义Fun团队,阿里巴巴集团) Independent Researcher(独立研究者)

AI总结 提出UniVocal统一框架,通过两阶段课程学习和链式思维生成,隐式从文本上下文推断发声模式,实现语音-歌唱代码切换合成,在SCSBench上达到最优性能。

Comments accepted by ACL 2026

详情
AI中文摘要

我们提出UniVocal,一个统一框架,隐式地从文本上下文中推断发声模式,开创了语音-歌唱代码切换(SCS)合成任务——其中转换由文本语义自主驱动,类似于无缝的人类语言混合。与单模式生成或依赖切换控制标签的系统不同,我们提出的UniVocal仅从文本上下文隐式推断发声模式。为实现这一点,我们采用了一种数据高效的两阶段课程学习策略,逐步训练一个具有竞争力的TTS系统以获得所需的SCS能力。针对数据稀缺问题,我们引入了一个可扩展的流水线来合成多样化的代码切换数据,这些数据在语义和声学上都很自然,同时引入了一个新的多场景基准SCSBench。为了解决语义分词器在捕捉声学细节方面的局限性,我们还引入了精炼的cent token和链式思维(CoT)生成,在内容生成之前规划韵律,有效增强了共情语音生成和歌唱旋律。实验结果表明,UniVocal在SCSBench上达到了最先进的性能,同时在常规语音和歌唱任务上保持了竞争性能。音频样本可在https://project-univocal-demo.github.io/demo/获取。代码和数据集已在https://github.com/FunAudioLLM/FunResearch/tree/main/UniVocal发布。

英文摘要

We propose UniVocal, a unified framework that implicitly infers vocal modes from text context to pioneer Speech-Singing Code-Switching (SCS) Synthesis - a task where transitions are autonomously driven by textual semantics, akin to seamless human language blending. Unlike single-mode generation or systems relying on switching-control tags, our proposed UniVocal implicitly infers vocal modes solely from text context. To achieve this, we employ a data-efficient two-stage curriculum learning strategy that progressively trains a competitive TTS system to acquire the desired SCS capability. Addressing data scarcity, we introduce a scalable pipeline to synthesize diverse code-switching data that is both semantically and acoustically natural, alongside a new multi-scenario benchmark, SCSBench. To address limitations of semantic tokenizers in capturing acoustic details, we also introduce refined cent token and Chain-of-Thought (CoT) generation for planning prosody before content generation, effectively enhancing empathetic speech generation and singing melody. Experimental results demonstrate that UniVocal achieves state-of-the-art performance on SCSBench while maintaining competitive performance on regular speech and singing tasks. Audio samples are available at https://project-univocal-demo.github.io/demo/. The code and dataset are released at https://github.com/FunAudioLLM/FunResearch/tree/main/UniVocal.

2606.01578 2026-06-02 eess.AS cs.SD 版本更新

Description and Discussion on DCASE 2026 Challenge Task 2: Noise-aware Unsupervised Anomalous Sound Detection for Machine Condition Monitoring

DCASE 2026挑战任务2:面向机器状态监测的噪声感知无监督异常声音检测——描述与讨论

Tomoya Nishida, Noboru Harada, Daiki Takeuchi, Daisuke Niizumi, Keisuke Imoto, Kota Dohi, Harsh Purohit, Takashi Endo, Yohei Kawaguchi

发表机构 * National Institute of Information and Communications Technology, Japan(日本信息与通信技术研究院)

AI总结 本文介绍DCASE 2026挑战任务2,通过利用近远双通道音频分离环境噪声与机器声音,提升无监督异常声音检测在噪声条件下的鲁棒性。

Comments this article draws heavily from arXiv:2506.10097

详情
AI中文摘要

本文概述了DCASE 2026挑战任务2,题为“面向机器状态监测的噪声感知无监督异常声音检测”。该任务旨在推进无监督设置下机器状态监测的噪声鲁棒异常声音检测,其中仅使用正常机器声音进行训练。在噪声条件下进行可靠检测对于实际部署至关重要,但以往的DCASE任务2设置提供的环境噪声信息有限,可能限制了高噪声情况下的UASD性能。为解决这一限制,DCASE 2026允许参与者利用同时在目标机器附近和远处采集的双通道音频样本。由于远处的麦克风预计包含相对更强的环境噪声和更弱的直接机器声音,它可能有助于从目标机器声音中区分环境噪声成分。在挑战提交截止日期后,将添加挑战结果和提交系统的分析。

英文摘要

This paper presents an overview of DCASE 2026 Challenge Task 2, titled "Noise-aware unsupervised anomalous sound detection (UASD) for machine condition monitoring." The task aims to advance noise-robust anomalous sound detection for machine condition monitoring under the unsupervised setting, where only normal machine sounds are available for training. Reliable detection under noisy conditions is crucial for practical deployment, but previous DCASE Task 2 settings provided limited information about environmental noise, potentially limiting UASD performance in highly noisy situations. To address this limitation, DCASE 2026 allows participants to exploit two-channel audio samples simultaneously captured at locations near and far from the target machine. Since the distant microphone is expected to contain relatively stronger environmental noise and weaker direct machine sounds, it may help distinguish environmental noise components from the target machine sounds. After the challenge submission deadline, challenge results and an analysis of the submitted systems will be added.

2606.01460 2026-06-02 cs.SD eess.AS 版本更新

A Lightweight Slot-Attention Framework for Multi-Instrument Multi-Pitch Estimation

轻量级槽注意力框架用于多乐器多音高估计

Michael Taenzer

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种轻量级槽注意力框架,通过匈牙利匹配和模块化扩展实现多乐器多音高估计,并验证了其在URMP上的乐器族分解效果。

Comments Preprint submitted to the IEEE 28th International Workshop on Multimedia Signal Processing (MMSP). This work has been submitted to the IEEE for possible publication. 6 pages, 2 figures

详情
AI中文摘要

多音高估计(MPE)通常预测混合信号中哪些音高是活跃的,但不预测是哪种乐器或声源产生的。本文研究了一种用于多乐器MPE(MI-MPE)的轻量级槽注意力框架,其中混合CQT被映射到一组无序的类声源音高图。该模型使用排列不变的匈牙利匹配来避免固定的输出语义,并将槽的数量视为活跃声源数量的上界。我们进一步研究了两种模块化扩展:一个自监督音色编码器,为槽级音色嵌入提供训练时目标;以及一个复音分支,正则化混合级和槽级预测的音高密度。实验表明,匈牙利匹配显著改善了URMP上的乐器族分解。音轨级预测仍然更具挑战性:音色和复音监督改善了特定配置,但未能一致地解决声源分配问题。结果表明,基于槽的架构是声源感知MPE的一个有前景的方向,同时强调了需要更仔细地将辅助音乐线索与槽身份耦合。

英文摘要

Multi-pitch estimation (MPE) typically predicts which pitches are active in a mixture, but not which instrument or source produced them. This paper investigates a lightweight slot-attention framework for multi-instrument MPE (MI-MPE), where a mixture CQT is mapped to an unordered set of source-like pitch maps. The model uses permutation-invariant Hungarian matching to avoid fixed output semantics and treats the number of slots as an upper bound on the number of active sources. We further study two modular extensions: a self-supervised timbre encoder that provides training-time targets for slot-level timbre embeddings, and a polyphony branch that regularizes the pitch density of mixture- and slot-level predictions. Experiments show that Hungarian matching substantially improves instrument family decomposition on URMP. Stem-level prediction remains more challenging: timbre and polyphony supervision improve selected configurations, but do not consistently resolve source assignment. The results suggest that slot-based architectures are a promising direction for source-aware MPE, while highlighting the need to couple auxiliary musical cues to slot identity more carefully.

2606.01264 2026-06-02 q-bio.NC cs.HC cs.SD eess.AS eess.SP 版本更新

A 1000-hour EEG-EMG-audio dataset of Japanese speech production

1000小时日语语音产生的EEG-EMG-音频数据集

Motoshige Sato, Ilya Horiguchi, Masakazu Inoue, Kenichi Tomeoka, Eri Hatakeyama, Yuya Kita, Atsushi Yamamoto, Ippei Fujisawa, Shuntaro Sasai

发表机构 * National Institute of Information and Communications Technology, Japan(日本信息与通信技术研究所)

AI总结 本研究构建了一个包含1020小时同步头皮脑电图、面部肌电图和语音音频的多模态数据集,来自三名健康日语母语者在开放词汇有声语音过程中的记录,旨在支持语音解码、多模态信号处理及脑电图表示学习等研究。

详情
AI中文摘要

我们提出了一个多模态数据集,包含来自三名健康日语母语者在开放词汇有声语音过程中同步记录的1020小时头皮脑电图(EEG)、面部肌电图(EMG)和语音音频。记录使用三种EEG系统——超高密度系统(g.Pangolin)和两种帽式系统(g.SCARABEO和eegosports),通道数从62到128不等——在数月内跨多个会话采集。每个会话提供时间同步的EEG、面部EMG和音频,以及语音事件注释和转录。尽管数据集的主要动机是语音解码,但它也支持多模态信号处理、伪影建模、纵向和跨设备适应以及EEG表示学习等工作。技术验证包括跨参与者、设备和任务的功率谱密度和事件相关电位分析,显示了预期的1/f频谱轮廓、任务相关的alpha频段衰减和时间锁定的诱发响应。该数据集以脑成像数据结构(BIDS)格式通过OpenNeuro在CC0豁免下发布,以支持语音相关及更广泛的EEG研究。

英文摘要

We present a multimodal dataset of 1020 hours of simultaneously recorded scalp electroencephalography (EEG), facial electromyography (EMG), and speech audio from three healthy native Japanese speakers during open-vocabulary overt speech. Recordings were acquired with three EEG systems-an ultra-high-density system (g.Pangolin) and two cap-type systems (g.SCARABEO and eegosports), spanning 62-128 channels-across many sessions over several months. Each session provides time-synchronized EEG, facial EMG, and audio, together with speech-event annotations and transcriptions. Although collected with speech decoding as a primary motivation, the dataset also supports work on multimodal signal processing, artifact modeling, longitudinal and cross-device adaptation, and EEG representation learning. Technical validation included power spectral density and event-related potential analyses across participants, devices, and tasks, which showed the expected 1/f spectral profile, task-related alpha-band attenuation, and time-locked evoked responses. The dataset is released in Brain Imaging Data Structure (BIDS) format via OpenNeuro under a CC0 waiver to support both speech-related and broader EEG research.

2606.01135 2026-06-02 cs.NE cs.SD 版本更新

Spiking and Event-driven Neuromorphic Mamba Models for Efficient Speech Recognition

用于高效语音识别的脉冲与事件驱动神经形态Mamba模型

Tauseef Ahmed, Tao Sun, Jeronimo Castrillon, Kanishkan Vadivel, Guangzhi Tang

发表机构 * University of California, Berkeley(加州大学伯克利分校) Tsinghua University(清华大学)

AI总结 提出脉冲和事件驱动的神经形态Mamba模型,通过激活稀疏性提升语音识别效率,在LibriSpeech上实现超过60%的激活稀疏性且精度损失小于1%,并开发周期精确的事件驱动模拟器实现算法-硬件协同优化。

Comments Accepted at IJCNN2026

详情
AI中文摘要

深度学习极大地推动了自动语音识别(ASR)的发展,使其能够广泛部署在智能手机和智能家居系统等边缘设备上。然而,深度神经网络的计算和能量需求给这种资源受限的部署带来了巨大挑战,导致延迟并限制了实时交互。神经形态计算通过脉冲神经网络(SNN)和事件驱动神经网络引入激活稀疏性,将密集运算转换为稀疏计算,提供了一种有前景的解决方案。然而,对于ASR,评估不同神经形态策略硬件优势的研究仍然缺乏。本文探索了脉冲和事件驱动的神经形态神经网络,以改进用于ASR的最先进SpeechMamba模型中的激活稀疏性。我们引入了一个带有FATReLU激活的事件驱动SpeechMamba,在LibriSpeech上实现了超过60%的激活稀疏性,且精度下降不到1%。我们还提出了一个脉冲SpeechMamba,其稀疏性超过70%,同时参数比同类SNN少30%。最后,我们开发了一个周期精确的事件驱动模拟器,实现了灵活的算法-硬件协同探索,帮助我们识别计算瓶颈,并带来超过10%的额外效率提升。

英文摘要

Deep learning has greatly advanced automatic speech recognition (ASR), enabling widespread deployment on edge devices such as smartphones and smart home systems. However, the computational and energy demands of deep neural networks pose significant challenges for such resource-constrained deployments, introducing latency and limiting real-time interaction. Neuromorphic computing offers a promising solution by introducing activation sparsity through spiking neural networks (SNNs) and event-driven neural networks, converting dense operations into sparse computations. However, a study that evaluates the hardware benefits of different neuromorphic strategies remains lacking for ASR. This paper explores spiking and event-driven neuromorphic neural networks to improve activation sparsity in the state-of-the-art SpeechMamba model for ASR. We introduce an event-driven SpeechMamba with FATReLU activation, achieving over 60% activation sparsity with less than 1% accuracy degradation on LibriSpeech. We also propose a spiking SpeechMamba that attains over 70% sparsity while using 30% fewer parameters than comparable SNNs. Finally, we develop a cycle-accurate event-driven simulator enabling flexible algorithm-hardware co-exploration, which helps us identify computational bottlenecks and yields over 10% additional efficiency improvements.

2606.01134 2026-06-02 eess.AS cs.LG cs.SD 版本更新

Context-aware child-directed speech detection from long-form recordings

基于上下文的儿童导向语音检测:从长时间录音中识别

Théo Charlot, Tarek Kunze, Kaveri K. Sheth, Alejandrina Cristia, Marvin Lavechin

发表机构 * LSCP, DEC, ENS, EHESS, CNRS, PSL University, France(法国社会科学高等学院(LSCP)、法国国家科学研究中心(DEC)、巴黎高等师范学院(ENS)、高等科学研究院(EHESS)、法国国家科学研究中心(CNRS)、巴黎社会科学大学(PSL University))

AI总结 本研究通过微调自监督模型、融入上下文信息以及端到端流水线评估,显著提升了从长时间录音中自动检测儿童导向语音的性能。

Comments 6 pages, 1 figure

详情
AI中文摘要

在长时间录音中自动区分儿童导向语音和成人导向语音,是可扩展分析儿童语言环境的关键。现有方法孤立地处理话语,并且主要针对英语进行评估。我们从三个维度解决这些不足。首先,我们在一个包含182名儿童的多语言数据集上微调并评估了六个自监督模型,表明在儿童中心录音上进行领域内预训练显著优于在成人语音上训练的模型。其次,我们证明融入周围上下文能大幅提升分类性能,平均F1分数绝对提升13.8%。第三,我们在一个现实的端到端流水线中评估我们的模型,从成人语音检测到受话者分类,显示在自动分割下性能有所下降,但仍持续优于基于规则的基线。

英文摘要

Automatically distinguishing child-directed speech from adult-directed speech in long-form recordings is key to scalable analyses of children's language environments. Existing approaches process utterances in isolation and have been evaluated primarily on English. We address these gaps along three dimensions. First, we fine-tune and evaluate six-self supervised models on a multilingual dataset of 182 children, showing that in-domain pre-training on child-centered recordings substantially outperforms models trained on adult speech. Second, we demonstrate that incorporating surrounding context substantially improves classification, with an absolute gain of 13.8% in average F1-score. Third, we evaluate our model in a realistic end-to-end pipeline, from adult speech detection to addressee classification, showing that performance drops under automatic segmentation but still consistently outperforms a rule-based baseline.

2606.01009 2026-06-02 cs.SD 版本更新

MelT: GEMM-Native NDFT for Efficient Single-Stage Audio Frontends on Modern Accelerators

MelT: 面向现代加速器的高效单级音频前端的GEMM原生NDFT

Augusto Camargo, Marcelo Finger

发表机构 * Instituto de Ciências Matemáticas e de Computação, University of São Paulo, Brazil(圣保罗大学数学与计算机科学研究所,巴西)

AI总结 提出MelT框架,通过将梅尔间隔非均匀离散傅里叶变换(NDFT)公式化为稠密通用矩阵乘法(GEMM)操作,实现单级音频前端,替代传统STFT+梅尔流水线,在多种加速器上获得高达3.75倍推理加速和3.52倍能耗降低。

详情
AI中文摘要

现代音频处理网络通常部署在加速器上,其峰值吞吐量通过稠密线性代数获得,而传统的声学前端——短时傅里叶变换(STFT)后接稀疏梅尔聚合——在结构上仍然是异构的。这种不匹配会在当代加速器后端引入内存带宽、调度和中间分配开销。本文介绍MelT,一个单级前端框架,其中梅尔间隔非均匀离散傅里叶变换(NDFT)基被预先计算,并通过稠密通用矩阵乘法(GEMM)操作应用于时域声学帧。贡献不在于NDFT算子本身,而在于将梅尔间隔NDFT投影公式化为GEMM原生的音频前端,并将其评估为传统STFT+梅尔流水线的硬件高效替代方案。在从Apple A18 Pro边缘硬件到NVIDIA H100数据中心加速器的多个平台上评估,MelT在保持下游分类准确性的同时,实现了高达3.75倍的推理延迟加速和3.52倍的能耗降低。

英文摘要

Modern audio processing networks are commonly deployed on accelerators whose peak throughput is obtained through dense linear algebra, whereas conventional acoustic frontends -- a Short-Time Fourier Transform (STFT) followed by sparse Mel aggregation -- remain structurally heterogeneous. This mismatch can introduce memory-bandwidth, dispatch, and intermediate-allocation overheads on contemporary accelerator backends. This work introduces MelT, a single-stage frontend framework in which Mel-spaced Non-Uniform Discrete Fourier Transform (NDFT) bases are precomputed and applied to time-domain acoustic frames through dense General Matrix Multiplication (GEMM) operations. The contribution is not the NDFT operator itself; rather, it is the formulation of Mel-spaced NDFT projection as a GEMM-native audio frontend and its evaluation as a hardware-efficient alternative to conventional STFT+Mel pipelines. Evaluated across platforms ranging from Apple A18 Pro edge hardware to NVIDIA H100 datacenter acceleration, MelT attains up to a $3.75\times$ speedup in inference latency and a $3.52\times$ reduction in energy consumption while maintaining downstream classification accuracy.

2606.00851 2026-06-02 cs.SD cs.CL cs.HC cs.LG eess.AS 版本更新

Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning

Sympatheia: 具有连续情感调节的情感自适应语音助手

Sukru Samet Dindar, Riki Shimizu, Xilin Jiang, Nima Mesgarani

发表机构 * Department of Electrical Engineering, Columbia University(电气工程系,哥伦比亚大学)

AI总结 提出Sympatheia语音对话框架,通过从用户语音推断情感并结合连续效价-唤醒度控制信号,实现情感自适应响应,优于基线模型。

详情
AI中文摘要

共情口语对话系统必须推断用户的情感状态以做出适当响应,然而日常语音通常带有微弱、中性或模糊的情感线索。为解决这一问题,我们引入了Sympatheia,一种语音到语音对话框架,其条件基于从用户语音中推断出的情感,并且在可用时,基于多模态感知模块或用户界面提供的连续效价-唤醒度(VA)控制信号中的明确情感规格。为了训练我们的模型,我们构建了Sympatheia-18k,一个包含12个情感锚点的情感条件合成口语对话语料库。该数据集包括用于学习情感语音行为的情感分割,以及一个中性分割,该分割将情感中性查询与多个情感条件响应配对,以在情感模糊情况下隔离明确的情感控制。实验结果表明,Sympatheia在生成语义内容和口语表达均情感适当的响应方面优于语音对话基线。我们进一步表明,相同的VA界面可以整合来自不同感知模块(包括面部表情、生物信号和文本情感描述)的情感估计,从而在语音单独提供有限情感证据时改善响应对齐。这些结果表明,连续情感调节是构建情感自适应语音助手的有效实际步骤。

英文摘要

Empathetic spoken dialogue systems must infer a user's emotional state to respond appropriately, yet everyday speech often carries weak, neutral, or ambiguous affective cues. To address this, we introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inferred from the user's speech and, when available, explicit affect specifications provided as a continuous valence--arousal (VA) control signal by a multimodal sensing module or user interface. To train our model, we construct Sympatheia-18k, an emotion-conditioned synthetic spoken dialogue corpus with 12 emotion anchors. This dataset includes an emotional split for learning affective speech behavior, and a neutral split that pairs emotionally neutral queries with multiple emotion-conditioned responses to isolate explicit emotion control in emotionally ambiguous cases. Empirical results show that Sympatheia outperforms speech conversational baselines in generating responses whose semantic content and spoken delivery are both emotionally appropriate. We further show that the same VA interface can integrate emotion estimates from diverse sensing modules, including facial expression, biosignals, and textual affect descriptions, improving response alignment when speech alone provides limited emotional evidence. These results suggest that continuous affect conditioning is an effective practical step for building emotionally adaptive voice assistants.

2606.00771 2026-06-02 cs.LG cs.AI cs.SD 版本更新

Logit Distillation on Manifolds: Mapping by Learning

流形上的对数蒸馏:通过学习进行映射

Yiru Yang, Junling Wang, Nishant Kumar Singh, Luohong Wu, Haoran Yan

发表机构 * University of Zurich(苏黎世大学) ETH Zurich(苏黎世联邦理工学院) Deutsche Bank Securities(德意志银行证券公司)

AI总结 提出一种层和点投影映射方法,将学生和教师表示对齐到高维嵌入空间,结合LoRA注入,在显著减少可训练参数的同时提高词错误率。

详情
AI中文摘要

提高几乎任何机器学习模型性能的一种简单方法是,不训练单个模型,而是训练多个使用不同算法的模型,这些模型对相同数据做出略有不同的预测和错误,从而提高平均预测和鲁棒性。然而,使用整个模型集成进行预测是繁琐且计算成本过高的,无法部署给大量用户,特别是当模型是大型神经网络时。为此,我们引入了一种层和点投影映射,在训练过程中将学生和教师表示映射到对齐的高维嵌入空间。所提出的方法结合LoRA注入,将学生模型的可训练参数减少到教师模型的不到1%,同时与其他蒸馏方法相比,显著提高了词错误率(WER),如消融研究所示。与专家混合不同,我们的方法可以快速并行训练。

英文摘要

A simple way to improve the performance of almost any machine learning model is not to train a single but several models with diverse algorithms which will make slightly distinct kinds of predictions and errors on the same data, and thus improve the average predictions and robustness. However, making predictions using a whole ensemble of models is cumbersome and computationally too expensive to allow deployment to a large number of users, especially if the models are large neural nets. In response to this, we introduce a layer and point wise projection mapping, which maps student and teacher representations into an aligned high-dimensional embedding space during training process. The proposed approach combined with LoRA injection reduces the student model trainable parameters to less than 1% of the teacher model, while significantly improving word error rate (WER) compared to other distillation methods, as demonstrated in ablation studies. Unlike a mixture of experts, our method can be trained rapidly and in parallel.

2606.00684 2026-06-02 eess.AS cs.CL cs.SD 版本更新

Local Diagnostics of Continuous Normalizing Flow for Out-of-Distribution Detection

连续归一化流用于分布外检测的局部诊断

Xinwei Cao, Mengxuan Lu, Torbjørn Svendsen, Giampiero Salvi

发表机构 * Department of Electronic Systems(电子系统系) Norwegian University of Science and Technology(挪威科学技术大学) Trondheim, Norway(特伦德内克,挪威)

AI总结 针对高维数据子空间中目标观测的分布外检测问题,提出基于连续归一化流的拉格朗日子流框架,通过速度场几何诊断信号设计零样本音素级发音错误检测指标,优于基于似然的方法。

Comments 16 pages, 5 figures

详情
AI中文摘要

我们解决了嵌入在高维数据空间子空间中的目标观测的分布外(OOD)检测问题。利用连续归一化流(CNFs),我们提出了一个拉格朗日子流(LSF)框架,旨在隔离并估计表示中相关分量的密度,同时将剩余分量作为上下文。通过对语音合成模型的实验,我们表明CNFs与其他深度生成模型(DGMs)类似,容易受到“似然悖论”的影响,即OOD样本被错误地赋予高似然。这归因于DGMs的归纳偏差,即优先考虑低级结构细节而非高级语义一致性。为了缓解这一现象,我们提出了基于子流轨迹上速度场的若干几何诊断信号。基于这些信号,我们为零样本音素级发音错误检测这一具有挑战性的任务设计了指标。最后,我们在一个真实的发音错误检测基准上展示了这些指标相对于基于似然的方法的优越性。

英文摘要

We address the problem of out-of-distribution (OOD) detection for target observations embedded in a subspace of the high dimensional data space. Using continuous normalizing flows (CNFs), we propose a Lagrangian sub-flow (LSF) framework designed to isolate and estimate the density for the relevant components in the representation and using the remaining components as context. Through experimentation with models for speech synthesis, we show that CNFs, similarly to other deep generative models (DGMs), are susceptible to the "likelihood paradox", where high likelihood is erroneously assigned to OOD samples. This is attributed to the inductive bias of DGMs that prioritize low-level structural details over high-level semantic coherence. To mitigate this phenomenon, we propose a number of geometric diagnostic signals based on the velocity field over the sub-flow trajectory. Based on these signals, we design metrics for the challenging task of zero-shot phoneme-level mispronunciation detection. Finally, we demonstrate the superiority of these metrics compared to likelihood-based methods on a real-world mispronunciation detection benchmark.

2606.00670 2026-06-02 cs.SD cs.AI 版本更新

Beyond the Mouth: Upper-Face Affective Cues in Audiovisual Sentence Recognition under Acoustic Uncertainty

超越口部:声学不确定性下视听句子识别中的上半脸情感线索

Zhou Yang, Yueyi Yang

发表机构 * Faculty of Education and Psychology, University of Oulu, Finland(奥卢大学教育与心理学学院,芬兰) Center for Machine Vision and Signal Analysis, University of Oulu, Finland(奥卢大学机器视觉与信号分析中心,芬兰)

AI总结 本研究利用CREMA-D语料库,通过特征分类器探究在声学退化条件下,上半脸情感信息是否有助于视听句子识别,发现上半脸情感线索能提升模型校准和鲁棒性。

详情
AI中文摘要

面对面言语理解本质上是多模态的,整合了声学信号与可见的发音、面部表情、头部运动及其他社交相关线索。虽然视听言语系统通常将口部区域作为语言信息的主要视觉来源,但情感面部表情常被单独视为情感识别目标。本文研究在声学退化条件下,上半脸情感信息是否有助于视听句子识别,超越音频和口部区域线索。使用CREMA-D视听情感言语语料库,我们在四种线索条件下训练基于特征的句子分类器:仅音频(A)、音频加口部/下半脸特征(A+M)、音频加上半脸特征(A+U)以及音频加口部和上半脸特征(A+M+U)。模型在干净音频和粉红噪声条件下(+10 dB、+5 dB和0 dB SNR)进行评估,采用演员独立划分。结果表明,在退化音频下,口部/下半脸特征提供了显著的鲁棒性优势。在0 dB SNR下,A+M相比A准确率提升0.0794,演员自举95%置信区间为[0.0296, 0.1298]。上半脸情感线索表现出更微妙的效果。尽管A+M+U相比A+M的直接准确率增益很小,但全脸模型在不同SNR水平上持续改善校准,并且在噪声条件下优于打乱的上半脸对照。这些发现表明,情感面部信息可能支持声学不确定性下的多模态鲁棒性和置信度估计,而不直接编码词汇内容。更广泛地说,该研究强调了社交表达性面部线索在以人为中心的视听交互系统中的潜在作用。

英文摘要

Face-to-face speech comprehension is inherently multimodal, integrating acoustic signals with visible articulation, facial expression, head motion, and other socially relevant cues. While audiovisual speech systems typically focus on the mouth region as the primary visual source of linguistic information, affective facial expressions are often treated separately as emotion-recognition targets. This paper investigates whether upper-face affective information contributes to audiovisual sentence recognition beyond audio and mouth-region cues, particularly under acoustic degradation. Using the CREMA-D audiovisual emotional speech corpus, we train feature-based sentence classifiers under four cue conditions: audio only (A), audio plus mouth/lower-face features (A+M), audio plus upper-face features (A+U), and audio plus both mouth and upper-face features (A+M+U). Models are evaluated on clean audio and pink-noise conditions at +10 dB, +5 dB, and 0 dB SNR using actor-independent splits. Results show that mouth/lower-face features provide substantial robustness benefits under degraded audio. At 0 dB SNR, A+M improves accuracy over A by 0.0794, with an actor-bootstrap 95% confidence interval of [0.0296, 0.1298]. Upper-face affective cues exhibit a more nuanced effect. Although the direct accuracy gain of A+M+U over A+M is small, full-face models consistently improve calibration across SNR levels and outperform shuffled upper-face controls under noisy conditions. These findings suggest that affective facial information may support multimodal robustness and confidence estimation under acoustic uncertainty without directly encoding lexical content. More broadly, the study highlights the potential role of socially expressive facial cues in human-centered audiovisual interaction systems.

2606.00629 2026-06-02 cs.SD cs.HC cs.LG eess.AS 版本更新

Quality Audio Prototyping: a prototype system for unified sound retrieval and procedural generation

质量音频原型:统一声音检索与程序化生成的系统原型

Nelly Garcia, Aditya Bhattacharjee, Gabryel Mason-Williams, Israel Mason-Williams, Emmanouil Benetos, Joshua Reiss

发表机构 * GitHub

AI总结 提出QuAP系统,通过统一基于内容的音频检索和实时程序化生成,并集成规则辅助参数指导,降低声音设计中的操作距离,经主观评估和用户测试验证了其有效性和实用性。

Comments DaFx 2026

详情
AI中文摘要

声音设计工作流经常在耗时的库搜索和复杂的程序化合成之间摇摆,从业者通常依赖独立的工具分别应对每个挑战。本文介绍了质量音频原型(QuAP),一个工作原型,它在单一界面中统一了基于内容的音频检索和程序化声音生成,减少了叙事概念与其声音实现之间的操作距离。QuAP集成了基于相似性的检索引擎与实时程序化音频模型,并辅以基于规则的助手,提供基于感知的参数指导,给出源自经验优化的定义和建议,而不需要先验的合成知识。初步评估证实了这种方法的可行性:主观评估显示六个嵌入合成模型中有五个在质量上具有统计显著性的提升,编码器消融研究在音效数据集上确立了首选的检索架构。与16名从业者的用户评估证实了该工具的工作流实用性,所有参与者一致认为参数助手在保持创作自主性的同时降低了程序化交互的门槛。

英文摘要

Sound design workflows frequently oscillate between time-consuming library searches and the complexity of procedural synthesis, with practitioners typically relying on disconnected tools to address each challenge separately. This paper introduces Quality Audio Prototyping (QuAP), a working prototype that unifies content-based audio retrieval and procedural sound generation within a single interface, reducing the procedural distance between a narrative concept and its sonic realisation. QuAP integrates a similarity-based retrieval engine with real-time procedural audio models, complemented by a rule-based assistant that provides perceptually informed parameter guidance, offering definitions and recommendations derived from empirical optimisation rather than requiring prior synthesis knowledge. Preliminary evaluation confirms the viability of this approach: subjective assessment demonstrated statistically significant quality improvements in five of six embedded synthesis models, and an encoder ablation study established the preferred retrieval architecture on a sound effect dataset. A user evaluation with 16 practitioners confirmed the tool's workflow utility, with all participants agreeing that the parameter assistant preserved creative agency while lowering the barrier to procedural interaction.

2606.00081 2026-06-02 cs.LG cs.AI cs.SD 版本更新

DAStatFormer: A Hybrid Multibranch Transformer with Statistical Feature Integration for DAS-Based Pattern Recognitions

DAStatFormer: 一种融合统计特征的混合多分支Transformer用于DAS模式识别

Michel Dione, Jerry Lonlac, Hélène Louis, Anthony Fleury, Stephane Lecoeuche

发表机构 * IMT Nord Europe, Institut Mines-Telecom, Univ. Lille, Centre for Digital Systems Lille, France(IMT北欧学院,法国电信研究院,里尔大学,数字系统研究中心,法国) IMT Mines Ales, Institut Mines-Telecom, Ales, France(IMT阿尔勒学院,法国电信研究院,阿尔勒,法国)

AI总结 针对DAS数据高维度和复杂时空模式问题,提出DAStatFormer混合多分支Transformer,通过提取24个ANOVA选择的统计特征并采用门控Transformer网络,在降低数据量级的同时实现高达99.4%的准确率。

详情
AI中文摘要

分布式声学传感(DAS)通过光纤实现大规模监测,但其高维度和复杂的时空模式使得事件分类具有挑战性。现有的深度学习方法——CNN、循环模型和Transformer变体——要么无法捕获长程依赖,要么需要以高昂成本处理原始DAS矩阵。我们提出DAStatFormer,一种混合多分支Transformer,将紧凑的多域统计特征与门控Transformer网络相结合。我们不是使用原始信号,而是从每个通道的时域、波形和频域提取24个ANOVA选择的属性,将数据量减少数个数量级,同时保留判别信息。每个域通过专用的逐步骤和逐通道注意力分支处理,并通过自适应门控机制融合。在开放的$\Phi$-OTDR基准测试和真实场景DAS数据集上的实验表明,DAStatFormer实现了高达99.4%的准确率和接近完美的实际性能,同时使用的参数和推理成本显著低于DASFormer和DeepViT等模型。这些结果证明了其适用于可扩展、实时的DAS监测。我们在https://github.com/MichelD-git/DAStatFormer发布代码。

英文摘要

Distributed Acoustic Sensing (DAS) enables large-scale monitoring through optical fibers, but its high dimensionality and complex spatio-temporal patterns make event classification demanding. Existing deep learning approaches-CNNs, recurrent models, and Transformer variants-either fail to capture long-range dependencies or require processing raw DAS matrices at prohibitive cost. We propose DAStatFormer, a hybrid multibranch Transformer that combines compact multidomain statistical features with Gated Transformer Networks. Instead of raw signals, we extract 24 ANOVA-selected attributes per channel from the temporal, waveform, and spectral domains, reducing data size by orders of magnitude while preserving discriminative information. Each domain is processed via dedicated step-wise and channel-wise attention branches, fused by an adaptive gating mechanism. Experiments on the open $Φ$-OTDR benchmark and a real-scenario DAS dataset show that DAS-tatFormer achieves up to 99.4% accuracy and near-perfect real-world performance, while using significantly fewer parameters and lower inference cost than models such as DASFormer and DeepViT. These results demonstrate its suitability for scalable, real-time DAS-based monitoring. We release our code at https://github.com/MichelD-git/DAStatFormer

2606.00066 2026-06-02 cs.SD eess.AS 版本更新

DUET: Unified Dual-Space Emotion Control for Diffusion and Flow-Matching Driven Text-to-Speech

DUET: 扩散与流匹配驱动的文本转语音的统一双空间情感控制

Xu Zhang, Longbing Cao, Zhangkai Wu

发表机构 * Frontier AI Research Centre, Macquarie University(前沿人工智能研究中心,麦考瑞大学)

AI总结 提出DUET框架,通过隐空间引导和梅尔谱梯度修正的双空间控制,在预训练扩散/流匹配TTS模型中实现细粒度情感控制,超越10个有监督基线。

详情
AI中文摘要

基于扩散和流匹配的文本转语音(TTS)模型在自然度方面表现出色,但由于情感信号与说话人身份纠缠,往往缺乏显式的情感控制。我们发现情感嵌入作为冻结隐藏状态的线性可解码方向出现,几乎与编码说话人身份的方向正交。这启发了一个即插即用框架DUET,用于对预训练的扩散和流匹配TTS模型进行情感控制。在生成过程中,DUET统一双空间控制,在单步更新中实现细粒度情感干预:隐空间引导沿目标情感方向移动生成,而梅尔谱引导通过从可微分声码器反向传播的梯度细化频谱细节。我们在三个数据集上的五个架构多样的预训练TTS骨干上验证了DUET,它跨范式超越了10个有监督的最先进情感TTS基线,并获得了最高的人类评分情感适宜性。为了进一步展示其定性行为,我们将DUET部署在Ameca人形机器人上,使其产生丰富表现力的情感语音,展示了即插即用情感交互在具身智能体中的巨大潜力。

英文摘要

Diffusion and flow-matching based text-to-speech (TTS) models excel in naturalness but often lack explicit emotion control, as emotional signals remain entangled with speaker identity. We discover that emotion embedding emerges as a linearly decodable direction of frozen hidden states, nearly orthogonal to the direction embedding speaker identity. This inspires a plug-and-play framework DUET for emotion control over pretrained diffusion and flow-matching based TTS models. During generation, DUET unifies dual-space control to achieve fine-grained emotion intervention in a single per-step update: hidden space steering shifts generation along the target emotion direction, while mel-space guidance refines spectral details through gradients backpropagated from a differentiable vocoder. We validate DUET on five architecturally diverse pretrained TTS backbones across three datasets, where it outperforms 10 supervised state-of-the-art emotional TTS baselines across paradigms and achieves the highest human-rated emotion appropriateness. To further showcase its qualitative behavior, we deploy DUET on an Ameca humanoid robot, where it produces richly expressive emotional speech on the humanoid, demonstrating the strong potential for plug-and-play affective interaction for embodied agents.

2605.30748 2026-06-02 cs.SD cs.AI eess.AS 版本更新

Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS

Chatterbox-Flash: 用于流式零样本TTS的先验校准块扩散

Deokjin Seo, Gangin Park, Kihyun Nam

发表机构 * Resemble AI Seoul National University(首尔国立大学) KAIST(韩国科学技术院)

AI总结 提出Chatterbox-Flash,通过将预训练自回归TTS解码器微调为块扩散解码器,实现块内并行生成与块间流式推理,并引入先验校准评分和早期解码调度解决长尾分布导致的生成质量下降问题。

Comments 8 pages, 4 figures, 9 tables

详情
AI中文摘要

我们提出Chatterbox-Flash,一种零样本文本转语音模型,通过将预训练的自回归TTS解码器微调为块扩散解码器获得,支持每个块内的并行令牌生成,同时保持逐块流式传输。我们发现,将主流的块扩散解码直接迁移到离散语音令牌会降低质量,因为长尾令牌分布使并行位置选择偏向少数高频令牌。为在不修改架构的情况下缓解这一问题,我们引入了两种推理时技术:先验校准评分(减去块级边际令牌分布)和早期解码调度(基于校准置信度自适应终止迭代)。在标准零样本TTS基准测试中,Chatterbox-Flash实现了与强自回归和非自回归基线相当的高保真合成,同时支持流式推理,首包时间与流式AR系统相当,且实时因子显著降低。代码和音频样本可在 https://github.com/resemble-ai/chatterbox-flash 获取。

英文摘要

We present Chatterbox-Flash, a zero-shot text-to-speech model obtained by fine-tuning a pretrained autoregressive TTS decoder into a block-diffusion decoder, enabling parallel token generation within each block while retaining block-by-block streaming. We find that naively transferring mainstream block-diffusion decoding to discrete speech tokens degrades quality, as a long-tail token distribution biases parallel position selection toward a few high-frequency tokens. To mitigate this without architectural modification, we introduce two inference-time techniques: prior-calibrated scoring, which subtracts the block-level marginal token distribution, and an early-decoding schedule, which adaptively terminates iteration based on calibrated confidence. On standard zero-shot TTS benchmarks, Chatterbox-Flash attains high-fidelity synthesis comparable to strong autoregressive and non-autoregressive baselines, while supporting streaming inference with time-to-first-packet on par with streaming AR systems and substantially lower real-time factor. Code and audio samples are available at https://github.com/resemble-ai/chatterbox-flash.

2605.29948 2026-06-02 cs.SD cs.AI eess.AS 版本更新

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

HoliTok: 一种具有鲁棒的双重语音生成与理解能力的连续整体式分词

Bohan Li, Shi Lian, Hankun Wang, Yiwei Guo, Yu Xi, Zhihan Li, Da Zheng, Colin Zhang, Kai Yu

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China(X-LANCE实验室,计算机科学学院,上海交通大学,中国) hi lab, Xiaohongshu Inc, China(hi实验室,小红书公司,中国)

AI总结 提出HoliTok连续整体式语音分词模型,通过渐进训练策略联合保持信号保真度、融入语义信息并维持潜在可学习性,基于该分词构建统一AR+DiT模型实现语音合成与识别,实验证明其在统一生成-理解架构中无需额外优化即可鲁棒运行。

Comments 14 pages, 2 figures, 8 tables

详情
AI中文摘要

统一的语音基础模型需要一个整体式的分词空间,该空间既要能被语言模型学习,又要能解码为高质量波形。然而,现有的语音分词器往往无法同时满足这些要求,导致架构复杂度和训练设计增加。我们提出HoliTok,一种用于统一生成-理解建模的连续整体式语音分词模型。HoliTok将48 kHz语音编码为紧凑的25 Hz序列,包含128维潜在向量。它采用渐进策略进行训练,联合保留信号级保真度、融入语义信息并保持强大的潜在可学习性。基于此分词,我们构建了一个统一的AR+DiT模型用于语音合成和识别,其中相同的潜在序列既支持生成特定任务,也支持统一的生成-理解任务。实验表明,HoliTok实现了有竞争力的重建保真度,提高了高质量和可控合成中的生成可学习性,并且在评估的表示中,它是唯一一个在我们的统一生成-理解架构中无需额外优化技巧即可鲁棒运行的表示。这些结果表明,HoliTok作为一种有效的语音分词器,为统一口语建模提供了基础的表示接口。代码可在 https://github.com/bovod-sjtu/HoliTok 获取。

英文摘要

Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often fail to satisfy these requirements simultaneously, leading to increased architectural complexity and more involved training designs. We propose HoliTok, a continuous Holistic speech Tokenization model designed for unified generation-understanding modeling. HoliTok encodes 48~kHz speech into a compact 25~Hz sequence of 128-dimensional latents. It is trained with a progressive strategy that jointly preserves signal-level fidelity, incorporates semantic information, and maintains strong latent learnability. Based on this tokenization, we build a unified AR+DiT model for speech synthesis and recognition, where the same latent sequence supports both generation-specific and unified generation-understanding tasks. Experiments show that HoliTok achieves competitive reconstruction fidelity, improves generative learnability for high-quality and controllable synthesis, and, among the evaluated representations, is the only one that operates robustly in our unified generation-understanding architecture without additional optimization tricks. These results suggest that HoliTok serves as an effective speech tokenizer and a foundational representation interface for unified spoken language modeling. The code is available at: https://github.com/bovod-sjtu/HoliTok.

2604.19532 2026-06-02 cs.SD cs.AI 版本更新

BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

BEAT: 通过均匀时间步对符号音乐进行分词和生成

Lekai Qian, Haoyu Gu, Jingwei Zhao, Ziyu Wang

发表机构 * South China University of Technology(南方科技大学) National University of Singapore(新加坡国立大学) Mohamed bin Zayed University of Artificial Intelligence(莫扎德·本·扎耶德人工智能大学) New York University(纽约大学)

AI总结 提出一种以均匀节拍为基本单元的分词方法,将同一时间步内相同音高的所有事件编码为一个令牌,并在音乐续写和伴奏生成任务中验证其相比传统事件基方法能提升音乐质量和结构连贯性。

详情
AI中文摘要

将音乐分词以适应语言模型的通用框架是一个具有挑战性的问题,特别是考虑到音乐可以表示的各种符号结构(例如,序列、网格和图)。迄今为止,大多数方法将符号音乐分词为音乐事件序列,如起始、音高、时移或复合音符事件。这种策略直观且已在基于Transformer的模型中证明有效,但它隐式处理了音乐时间的规律性:单个令牌可能跨越不同时长,导致时间进展不均匀。在本文中,我们考虑另一种分词方式是否可能,其中均匀长度的音乐步长(例如,一个节拍)作为基本单元。具体来说,我们将单个时间步内相同音高的所有事件编码为一个令牌,并显式按时间步对令牌进行分组,这类似于钢琴卷帘表示的稀疏编码。我们在音乐续写和伴奏生成任务上评估了所提出的分词方法,并将其与主流事件基方法进行比较。结果表明,所提出的分词方法提高了音乐质量和结构连贯性,而额外分析证实了更高的效率和更有效地捕获长程模式。

英文摘要

Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences of musical events, such as onsets, pitches, time shifts, or compound note events. This strategy is intuitive and has proven effective in Transformer-based models, but it treats the regularity of musical time implicitly: individual tokens may span different durations, resulting in non-uniform time progression. In this paper, we instead consider whether an alternative tokenization is possible, where a uniform-length musical step (e.g., a beat) serves as the basic unit. Specifically, we encode all events within a single time step at the same pitch as one token, and group tokens explicitly by time step, which resembles a sparse encoding of a piano-roll representation. We evaluate the proposed tokenization on music continuation and accompaniment generation tasks, comparing it with mainstream event-based methods. Results show improved musical quality and structural coherence, while additional analyses confirm higher efficiency and more effective capture of long-range patterns with the proposed tokenization.

2605.07061 2026-06-02 cs.SD cs.AI cs.CV cs.MM 版本更新

Do Joint Audio-Video Generation Models Understand Physics?

联合音视频生成模型是否理解物理?

Zijun Cui, Xiulong Liu, Hao Fang, Mingwei Xu, Jiageng Liu, Zexin Xu, Weiguo Pian, Shijian Deng, Feiyu Du, Chenming Ge, Yapeng Tian

发表机构 * University of Texas at Dallas(德克萨斯大学达拉斯分校) University of Washington(华盛顿大学) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 针对联合音视频生成模型,提出AV-Phys Bench基准测试其物理常识,发现所有模型在物理一致性上表现不足,尤其是事件驱动和环境驱动转换场景。

Comments Preprint. Project Page: https://zijuncui.com/AV-Phys/. Full abstract appears in the PDF

详情
AI中文摘要

联合音视频生成模型正迅速接近专业制作质量,这引发了一个核心问题:它们是否理解音视频物理,还是仅仅生成看似合理但违反现实一致性的声音和帧?我们引入了AV-Phys Bench,一个用于评估联合音视频生成中物理常识的基准。AV-Phys Bench测试模型在三种场景类别上的表现:稳态、事件转换和环境转换。它涵盖了从现实场景中提取的基于物理的子类别,以及故意要求物理不一致音视频行为的反AV物理提示。每个生成结果沿五个维度评估:视觉语义遵循、音频语义遵循、视觉物理常识、音频物理常识和跨模态物理常识。在三个专有模型和四个开源模型中,我们发现Seedance 2.0整体表现最佳,但所有模型距离鲁棒的物理理解仍有很大差距。在事件驱动和环境驱动转换上性能急剧下降,即使是强大的专有系统在反AV物理提示上也崩溃。我们进一步引入了AV-Phys Agent,一个结合多模态语言模型与确定性声学测量工具的ReAct风格评估器,产生的排名与人类评分高度一致。我们的结果指出,跨模态物理一致性和转换驱动的场景动态是联合音视频生成的关键开放挑战。

英文摘要

Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand audio-visual physics, or merely generate plausible sounds and frames that violate real-world consistency? We introduce AV-Phys Bench, a benchmark for evaluating physical commonsense in joint audio-video generation. AV-Phys Bench tests models across three scene categories: Steady State, Event Transition, and Environment Transition. It covers physics-grounded subcategories drawn from real-world scenes, plus Anti-AV-Physics prompts that deliberately request physically inconsistent audio-video behavior. Each generation is evaluated along five dimensions: visual semantic adherence, audio semantic adherence, visual physical commonsense, audio physical commonsense, and cross-modal physical commonsense. Across three proprietary and four open-source models, we find that Seedance 2.0 performs best overall, but all models remain far from robust physical understanding. Performance drops sharply on event-driven and environment-driven transitions, and even strong proprietary systems collapse on Anti-AV-Physics prompts. We further introduce AV-Phys Agent, a ReAct-style evaluator that combines a multimodal language model with deterministic acoustic measurement tools, producing rankings that closely align with human ratings. Our results identify cross-modal physical consistency and transition-driven scene dynamics as key open challenges for joint audio-video generation.

2604.18360 2026-06-02 cs.SD cs.CL 版本更新

Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval

Omni-Embed-Audio: 利用多模态大语言模型实现鲁棒的音频-文本检索

HaeJun Yoo, Yongseop Shin, Insung Lee, Myoung-Wan Koo, Du-Seong Chang

发表机构 * Sogang University(首尔大学)

AI总结 提出Omni-Embed-Audio(OEA)检索编码器,利用多模态大语言模型原生理解音频,并通过用户意图查询(UIQ)和硬负样本挖掘,在文本到音频检索中达到与M2D-CLAP相当的性能,同时在文本到文本检索和硬负样本判别上显著优于现有方法。

Comments Accepted at ACL 2026 Main Conference. Camera-ready version

详情
AI中文摘要

基于对比语言-音频预训练(CLAP)的音频-文本检索系统在传统基准上表现强劲;然而,这些基准依赖于与真实世界搜索行为差异显著的标题风格查询,限制了其对实际检索鲁棒性的评估。我们提出了Omni-Embed-Audio(OEA),一种利用具有原生音频理解能力的多模态大语言模型的检索导向编码器。为了系统评估超越标题风格查询的鲁棒性,我们引入了用户意图查询(UIQ)——五种反映自然搜索行为的表述形式:问题、命令、关键词标签、释义和基于排除的负查询。对于负查询,我们开发了一个硬负样本挖掘管道,并提出了判别指标(HNSR, TFR),评估模型抑制声学相似干扰物的能力。在AudioCaps、Clotho和MECAT上的实验表明,OEA在文本到音频检索性能上与最先进的M2D-CLAP相当,同时在两个关键领域展现出明显优势:(1)主导的文本到文本检索(相对提升22%),以及(2)显著优越的硬负样本判别(HNSR@10提升4.3个百分点,TFR@10相对提升34.7%),揭示了大语言模型骨干对复杂查询具有更优的语义理解能力。

英文摘要

Audio-text retrieval systems based on Contrastive Language-Audio Pretraining (CLAP) achieve strong performance on traditional benchmarks; however, these benchmarks rely on caption-style queries that differ substantially from real-world search behavior, limiting their assessment of practical retrieval robustness. We present Omni-Embed-Audio (OEA), a retrieval-oriented encoder leveraging multimodal LLMs with native audio understanding. To systematically evaluate robustness beyond caption-style queries, we introduce User-Intent Queries (UIQs) - five formulations reflecting natural search behaviors: questions, commands, keyword tags, paraphrases, and exclusion-based negative queries. For negative queries, we develop a hard negative mining pipeline and propose discrimination metrics (HNSR, TFR) assessing models' ability to suppress acoustically similar distractors. Experiments on AudioCaps, Clotho, and MECAT show that OEA achieves comparable text-to-audio retrieval performance to state-of-the-art M2D-CLAP, while demonstrating clear advantages in two critical areas: (1) dominant text-to-text retrieval (+22% relative improvement), and (2) substantially superior hard negative discrimination (+4.3%p HNSR@10, +34.7% relative TFR@10), revealing that LLM backbones provide superior semantic understanding of complex queries.

2604.01562 2026-06-02 cs.SD cs.AI cs.CL cs.CY cs.HC 版本更新

Acoustic and perceptual differences between standard and accented speech and their voice clones

标准口音与带口音语音及其语音克隆的声学与感知差异

Tianle Yang, Chengzhe Sun, Phil Rose, Siwei Lyu

发表机构 * Department of Linguistics, University at Buffalo, United States(语言学系,布法罗大学,美国) Department of Computer Science and Engineering, University at Buffalo, United States(计算机科学与工程系,布法罗大学,美国) Emeritus Faculty, Australian National University, Australia(澳大利亚国立大学荣誉教职)

AI总结 通过计算和感知实验,比较标准口音与带口音普通话及其语音克隆,发现口音影响感知身份匹配和可懂度,且标准口音克隆更接近原声,带口音克隆可懂度提升更大。

详情
AI中文摘要

语音克隆通常根据整体质量进行评估,但关于口音保留及其感知后果的了解较少。我们采用计算和感知相结合的设计,比较标准口音和重度口音普通话及其语音克隆。基于嵌入的分析显示,在多个说话人判别嵌入空间中,带口音说话人的原始-克隆距离更大,但在根据每个说话人的原始内部基线变异性进行归一化后,这种差异消失。在感知研究中,标准口音说话人的克隆被评价为比带口音说话人的克隆更接近其原始声音,并且从原始到克隆的可懂度增加,其中带口音语音的增益更大。这些结果表明,即使口音变异未反映在基线归一化的说话人嵌入距离中,它也能影响语音克隆中的感知身份匹配和可懂度,并促使将口音保留视为说话人身份保留的一个明确组成部分,而不是假设它完全由现成的说话人判别嵌入所捕获。

英文摘要

Voice cloning is often evaluated in terms of overall quality, but less is known about accent preservation and its perceptual consequences. We compare standard and heavily accented Mandarin speech and their voice clones using a combined computational and perceptual design. Embedding-based analyses showed larger original-clone distances for accented speakers in several speaker-discriminative embedding spaces, but this difference disappeared after normalizing against each speaker's within-original baseline variability. In the perception study, clones are rated as more similar to their originals for standard than for accented speakers, and intelligibility increases from original to clone, with a larger gain for accented speech. These results show that accent variation can shape perceived identity match and intelligibility in voice cloning even when it is not reflected in baseline-normalized speaker-embedding distance, and they motivate treating accent preservation as an explicit component of speaker identity preservation, rather than assuming that it is fully captured by off-the-shelf speaker-discriminative embeddings.

2510.00180 2026-06-02 eess.AS cs.SD eess.SP 版本更新

DiffAU: Diffusion-Based Ambisonics Upscaling

DiffAU: 基于扩散的Ambisonics升阶

Amit Milstein, Nir Shlezinger, Boaz Rafaely

发表机构 * Technion - Israel Institute of Technology(技术学院 - 以色列理工学院)

AI总结 提出DiffAU方法,利用扩散模型和空间音频适配,从一阶Ambisonics生成三阶Ambisonics,实现快速可靠的升阶。

详情
AI中文摘要

空间音频通过再现3D声场增强沉浸感,Ambisonics为此提供了可扩展的格式。与高阶Ambisonics(HOA)相比,一阶Ambisonics(FOA)在硬件上高效地获取和存储声场,但其低空间分辨率限制了真实感,因此Ambisonics升阶(AU)作为增加Ambisonics信号阶数的方法显得尤为重要。本文提出DiffAU,一种级联的AU方法,利用扩散模型的最新进展并结合对空间音频的新颖适配,从FOA生成三阶Ambisonics。通过学习数据分布,DiffAU提供了一种原则性方法,能够在各种设置中快速可靠地再现HOA。在多个扬声器的消声条件下进行的实验,展示了强大的客观和感知性能。

英文摘要

Spatial audio enhances immersion by reproducing 3D sound fields, with Ambisonics offering a scalable format for this purpose. While first-order Ambisonics (FOA) notably facilitates hardware-efficient acquisition and storage of sound fields as compared to high-order Ambisonics (HOA), its low spatial resolution limits realism, highlighting the need for Ambisonics upscaling (AU) as an approach for increasing the order of Ambisonics signals. In this work we propose DiffAU, a cascaded AU method that leverages recent developments in diffusion models combined with novel adaptation to spatial audio to generate 3rd order Ambisonics from FOA. By learning data distributions, DiffAU provides a principled approach that rapidly and reliably reproduces HOA in various settings. Experiments in anechoic conditions with multiple speakers, show strong objective and perceptual performance.

2602.02557 2026-06-02 cs.LG cs.AI cs.SD 版本更新

The Alignment Curse: Modality Alignment Supercharges Audio Attacks via Text Transfer

对齐诅咒:模态对齐通过文本传输增强音频攻击

Yupeng Chen, Junchi Yu, Aoxi Liu, Baoyuan Wu, Philip Torr, Adel Bibi

发表机构 * University of Oxford(牛津大学)

AI总结 本文提出并验证了“对齐诅咒”原理,即更强的文本-音频模态对齐会促进文本攻击向音频的迁移,并通过黑盒实验表明文本转移的音频攻击性能与原生音频攻击相当甚至更优,揭示了能力与安全之间的根本矛盾。

Comments 23 pages, 5 figures

详情
AI中文摘要

近期端到端训练的全能模型通过加强文本-音频模态对齐显著提升了音频能力。然而,这种对齐是否无意中促进了安全漏洞跨模态的转移仍未被充分探索。这一问题至关重要,因为基于文本的越狱攻击远比基于音频的攻击成熟;如果它们系统性转移,当前的音频安全评估可能低估源自文本模态的风险。在本文中,我们引入了“对齐诅咒”,这是一个经过形式化表征和实证验证的原理,表明更强的模态对齐使得攻击从文本到音频的转移更有效,揭示了能力与安全之间的根本矛盾。基于这一原理,我们在最新的全能模型(如Qwen2.5-Omni、Qwen3-Omni)上对三类攻击(文本攻击、文本转移的音频攻击和音频攻击)进行了全面的黑盒评估。我们发现,文本转移的音频攻击与基于音频的攻击表现相当,甚至更优,在仅音频访问下展现出明显优势。这表明基于文本的漏洞在塑造音频安全风险中扮演关键角色。最后,我们实证分析了不同攻击方法和模型下模态对齐与转移有效性之间的关系,观察到对“对齐诅咒”的一致支持:更紧密的模态对齐导致更有效的跨模态攻击转移。

英文摘要

Recent advances in end-to-end trained omni-models have substantially improved audio capabilities by strengthening text-audio modality alignment. However, whether such alignment inadvertently facilitates the transfer of safety vulnerabilities across modalities remains underexplored. This question is critical as text-based jailbreak attacks are considerably more mature than audio-based ones; if they transfer systematically, current audio safety evaluations may underestimate risks originating from the text modality. In this paper, we introduce the Alignment Curse, a formally characterized and empirically validated principle showing that stronger modality alignment enables more effective transfer of attacks from text to audio, revealing a fundamental tension between capability and safety. Motivated by this principle, we conduct a comprehensive black-box evaluation of three attack categories on recent omni-models (e.g., Qwen2.5-Omni, Qwen3-Omni): text attacks, text-transferred audio attacks, and audio attacks. We find that text-transferred audio attacks perform comparably to, and often better than, audio-based attacks, exhibiting a clear advantage under audio-only access. This suggests that text-based vulnerabilities play a pivotal role in shaping audio safety risks. Finally, we empirically analyze the relationship between modality alignment and transfer effectiveness across attack methods and models, observing consistent support for the Alignment Curse: tighter modality alignment leads to more effective cross-modality attack transfer.

2601.06199 2026-06-02 eess.AS cs.AI cs.SD 版本更新

FastSLM: Hierarchical Temporal Abstraction for Efficient Long-Form Speech Adaptation

FastSLM:用于高效长语音自适应的层次时间抽象

Junseok Lee, Sangyong Lee, Chang-Jae Chun

发表机构 * OKESTRO Sejong University(世宗大学)

AI总结 针对长语音输入中标记爆炸问题,提出FastSLM架构,通过层次时间抽象器(HTA)实现每秒1.67个标记的极端压缩率(减少97%),在显著降低计算量和参数的同时,在长语音基准上达到与最先进模型竞争的性能。

Comments Title updated

详情
AI中文摘要

将多模态大语言模型(MLLMs)扩展到长语音受到输入标记爆炸式增长的瓶颈限制。与图像或视频不同,音频缺乏重叠信息,使得极端的1-标记压缩极易丢失细粒度声学线索。为克服这一问题,我们提出FastSLM,一种标记高效的架构,其核心是层次时间抽象器(HTA)。HTA在多个时间尺度上逐步蒸馏非重叠的声学特征,实现了每秒1.67个标记的极端压缩率——减少了97%而不丢失关键上下文。实验结果表明,尽管FastSLM使用的FLOPs和参数显著更少,但在长语音基准上仍能达到与最先进模型竞争的性能。源代码和模型检查点可在https://anonymous.4open.science/r/FastSLM-8BD3获取。

英文摘要

Scaling Multimodal Large Language Models (MLLMs) to long-form speech is bottlenecked by the explosive growth of input tokens. Unlike images or videos, audio lacks overlapping information, making extreme 1-token compression highly susceptible to the loss of fine-grained acoustic cues. To overcome this, we propose FastSLM, a token-efficient architecture featuring the Hierarchical Temporal Abstractor (HTA). HTA progressively distills non-overlapping acoustic features across multiple temporal scales, achieving an extreme compression rate of 1.67 tokens per second a 97% reduction without losing critical context. Experimental results show that FastSLM achieves competitive performance with state-of-the-art models on long-form benchmarks despite operating with significantly fewer FLOPs and parameters. The source code and model checkpoints are available at https://anonymous.4open.science/r/FastSLM-8BD3.

2601.19919 2026-06-02 cs.CL cs.AI cs.SD 版本更新

ASKD-Whisper: Adaptive Self-knowledge Distillation for Efficient and Low-Latency Automatic Speech Recognition

ASKD-Whisper: 自适应自知识蒸馏用于高效低延迟自动语音识别

Junseok Lee, Nahun Kim, Sangyong Lee, Chang-Jae Chun

发表机构 * OKESTRO Co., Ltd(OKESTRO公司) Sejong University(世宗大学)

AI总结 提出自适应自知识蒸馏(ASKD)动态课程框架,通过逐步减少对教师模型的依赖并引入自知识蒸馏阶段,在压缩Whisper模型时实现5倍推理加速和1.07%词错误率降低。

Comments Title and content have been updated

详情
AI中文摘要

知识蒸馏(KD)是将大规模基础模型压缩为可部署架构的最有效范式之一。在自动语音识别(ASR)背景下,先前研究主要侧重于强制学生模型严格模仿大型教师模型的预测分布。然而,这种静态依赖通常存在固有权衡:虽然学生快速获得基本语言表示,但同时继承了教师特定领域的盲点和过度自信的幻觉,导致分布外泛化能力严重下降。为有效缓解此问题,我们提出自适应自知识蒸馏(ASKD),一种动态课程框架。ASKD随着训练进行系统地衰减对教师分布的依赖——从而释放学生独立推理能力——随后采用自知识蒸馏阶段作为结构正则化器。通过应用ASKD,我们将庞大的Whisper架构蒸馏为紧凑变体ASKD-Whisper。在跨多种声学领域的综合评估中,ASKD-Whisper不仅实现了5倍推理延迟加速,还以1.07%更低的词错误率(WER)超越了其教师模型。这些结果表明,ASKD有效防止了教师引起的过拟合,并为可泛化模型压缩建立了新的最先进水平。

英文摘要

Knowledge distillation (KD) is one of the most effective paradigms for compressing large-scale foundation models into deployable architectures. In the context of Automatic Speech Recognition (ASR), previous studies have predominantly focused on forcing the student model to strictly mimic the predictive distribution of a massive teacher model. However, this static dependency often presents an inherent trade-off: while the student rapidly acquires basic linguistic representations, it simultaneously inherits the teacher's domain-specific blind spots and over-confident hallucinations, leading to a severe decline in out-of-distribution generalization capacity. To effectively mitigate this issue, we propose Adaptive Self-Knowledge Distillation (ASKD), a dynamic curriculum framework. ASKD systematically decays the dependency on the teacher's distribution as training progresses-thereby unlocking the student's independent reasoning capacity-and subsequently employs a self-knowledge distillation phase to act as a structural regularizer. By applying ASKD, we distill the massive Whisper architecture into a compact variant, ASKD-Whisper. In our comprehensive evaluations across diverse acoustic domains, ASKD-Whisper not only achieves a 5x speedup in inference latency but also outperforms its teacher model by yielding a 1.07% lower word error rate (WER). These results demonstrate that ASKD effectively prevents teacher-induced overfitting and establishes a new state-of-the-art for generalizable model compression.

2601.03615 2026-06-02 cs.CL cs.SD eess.AS 版本更新

SARA: Stress Test Reasoning in Audio Deepfake Detection

SARA: 音频深度伪造检测中的压力测试推理

Binh Nguyen, Charles Fleming, Thai Le

发表机构 * Indiana University(印第安纳大学) Cisco Research(思科研究)

AI总结 提出SARA框架,通过声学感知、推理-判决一致性与不和谐三个维度评估音频语言模型在对抗攻击下的推理可靠性,发现声学攻击降低一致性而语言攻击保持一致性但成功率更高,且推理轨迹的文本一致性可作为检测对抗样本的潜在指标。

Comments Preprint for ACL 2026 submission

详情
AI中文摘要

音频语言模型(ALMs)通过提供推理轨迹来透明化其预测,为可解释的音频深度伪造检测(ADD)提供了有前景的转变,超越了黑盒分类器。然而,这种推理可能不支持模型预测,反映出一致性差,或者更糟的是,可能用看似合理但具有误导性的解释来合理化错误预测。此外,ALM推理在对抗攻击下的行为仍未得到充分探索,引发了关于这种解释能力实际可靠性的疑问。为填补这一空白,本研究引入了SARA(音频推理的移位分析),这是一个诊断框架,从三个维度评估ALM推理:声学感知、推理-判决一致性与不和谐。我们针对声学和语言对抗攻击测试了五个开源ALM。结果表明,声学攻击显著降低了推理-判决一致性(平均下降14.20%),经常引发内部逻辑冲突。相反,语言攻击在保持推理一致性的同时实现了更高的攻击成功率。我们进一步证明,生成的推理轨迹的文本一致性也可作为对抗输入的潜在指标,从而无需访问原始声学信号即可有效检测受扰音频(F1为0.78)。这些发现表明,即使最终分类输出受损,推理轨迹仍具有诊断效用。

英文摘要

Audio Language Models (ALMs) offer a promising shift towards explainable audio deepfake detections (ADD), moving beyond \textit{black-box} classifiers by providing transparency to their predictions via reasoning traces. However, such reasoning may not support the model predictions, reflecting poor coherence, or, worse, may rationalize incorrect predictions with plausible but misleading explanation. Moreover, the behavior of ALM reasoning under adversarial attacks remains under-explored, raising questions about the practical reliability of such explanation capabilities. To address this gap, this study introduces \textbf{SARA} (\textbf{S}hift \textbf{A}nalysis of \textbf{R}easoning in \textbf{A}udio), a diagnostic framework that evaluates ALM reasoning across three dimensions: acoustic perception, reasoning-verdict coherence and dissonance. We test five open-source ALMs against both acoustic and linguistic adversarial attacks. We show that acoustic attacks significantly degrade reasoning-verdict coherence (average decrease of 14.20\%), frequently inducing internal logical conflicts. Conversely, linguistic attacks achieve higher attack success rates while maintaining reasoning coherence. We further demonstrate that the textual coherence of generated reasoning traces also serves as a latent indicator of adversarial inputs, enabling effective detection of perturbed audio (0.78 in F1) \textit{without accessing the raw acoustic signal}. These findings suggest that reasoning traces provide diagnostic utility that persists even when final classification outputs are compromised.

2512.10120 2026-06-02 cs.SD cs.AI 版本更新

VocSim: A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio

VocSim:单源音频中零样本内容身份的无训练基准

Maris Basha, Anja Zai, Sabine Stoll, Richard Hahnloser

发表机构 * University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出VocSim,一个无需训练的无标签基准,通过冻结嵌入的几何对齐评估通用音频表示在零样本内容身份识别中的性能,并在多领域音频上取得强结果,同时揭示跨语言泛化差距。

Comments Accepted at ICML 2026. Code: https://github.com/vocsim/benchmark

详情
AI中文摘要

通用音频表示旨在将同一事件的声学可变实例映射到邻近点,在零样本设置中解决内容身份问题。与通过参数更新衡量适应性的监督分类基准不同,我们引入了VocSim,一个无需训练的基准,探测冻结嵌入的内在几何对齐,不更新任何参数也不使用标签(每个子集拟合一个无标签PCA白化以校正各向异性)。VocSim汇集了来自19个语料库的125k个单源片段,涵盖人类语音、动物发声和环境声音,将内容表示与源分离隔离开来(多声道混合超出范围)。我们使用Precision@k评估局部纯度,使用全局分离率(GSR)评估逐点类别分离,并通过相对于经验置换基线的提升进行校准。一个简单的冻结Whisper特征、时频池化和无标签PCA的流程在跨领域上产生了强大的零样本性能,GSR排名稳定(Kendall's tau = 0.60)。然而,在低资源盲语音(Shipibo-Conibo、Chintang)上,局部检索崩溃但仍高于随机水平,暴露了跨语言语音泛化差距。作为外部验证,我们的顶级嵌入预测了鸟类感知相似性,改进了生物声学分类,并在HEAR基准上达到了最先进水平。我们发布了数据、代码和公共排行榜。

英文摘要

General-purpose audio representations aim to map acoustically variable instances of the same event to nearby points, resolving content identity in a zero-shot setting. Unlike supervised classification benchmarks that measure adaptability via parameter updates, we introduce VocSim, a training-free benchmark probing the intrinsic geometric alignment of frozen embeddings, with no parameters updated and no labels used (a label-free PCA whitening is fit per subset to correct anisotropy). VocSim aggregates 125k single-source clips from 19 corpora spanning human speech, animal vocalizations, and environmental sounds, isolating content representation from source separation (polyphonic mixtures are out of scope). We evaluate embeddings with Precision@k for local purity and the Global Separation Rate (GSR) for point-wise class separation, calibrated by lift over an empirical permutation baseline. A simple pipeline of frozen Whisper features, time-frequency pooling, and label-free PCA yields strong zero-shot performance with stable GSR rankings across domains (Kendall's tau = 0.60). However, on blind low-resource speech (Shipibo-Conibo, Chintang), local retrieval collapses while remaining above chance, exposing a cross-lingual speech generalization gap. As external validation, our top embeddings predict avian perceptual similarity, improve bioacoustic classification, and achieve state-of-the-art on the HEAR benchmark. We release data, code, and a public leaderboard.

2511.13487 2026-06-02 eess.AS cs.LG cs.SD 版本更新

Systematic Evaluation of Time-Frequency Features for Binaural Sound Source Localization

双耳声源定位的时频特征系统评估

Davoud Shariat Panah, Alessandro Ragano, Dan Barry, Jan Skoglund, Andrew Hines

发表机构 * Taighde Éireann – Research Ireland(塔尔德·爱尔兰——爱尔兰研究)

AI总结 系统评估不同时频特征组合对双耳声源定位性能的影响,发现精心选择的特征组合(如通道频谱图结合ILD和IPD)可超越增加模型复杂度,为领域特定和通用定位提供实用指导。

Comments Accepted at EUSIPCO 2026

详情
AI中文摘要

本研究对双耳声源定位(SSL)的时频特征设计进行了系统评估,重点关注特征选择如何在多样条件下影响模型性能。我们研究了使用基于幅度特征(幅度频谱图、耳间电平差 - ILD)和基于相位特征(相位频谱图、耳间相位差 - IPD)的各种组合的卷积神经网络(CNN)模型的性能。在域内和域外数据(具有不匹配的头部相关传递函数 - HRTFs)上的评估表明,精心选择的特征组合通常优于增加模型复杂度。虽然诸如ILD + IPD的双特征集足以用于域内SSL,但泛化到多样内容需要更丰富的输入,结合通道频谱图与ILD和IPD。使用最优特征集,我们的低复杂度CNN模型实现了有竞争力的性能。我们的发现强调了特征设计在双耳SSL中的重要性,并为领域特定和通用定位提供了实用指导。

英文摘要

This study presents a systematic evaluation of time-frequency feature design for binaural sound source localization (SSL), focusing on how feature selection influences model performance across diverse conditions. We investigate the performance of a convolutional neural network (CNN) model using various combinations of amplitude-based features (magnitude spectrogram, interaural level difference - ILD) and phase-based features (phase spectrogram, interaural phase difference - IPD). Evaluations on in-domain and out-of-domain data with mismatched head-related transfer functions (HRTFs) reveal that carefully chosen feature combinations often outperform increases in model complexity. While two-feature sets such as ILD + IPD are sufficient for in-domain SSL, generalization to diverse content requires richer inputs combining channel spectrograms with both ILD and IPD. Using the optimal feature sets, our low-complexity CNN model achieves competitive performance. Our findings underscore the importance of feature design in binaural SSL and provide practical guidance for both domain-specific and general-purpose localization.

2510.01891 2026-06-02 cs.SD cs.AI eess.AS 版本更新

HRTFformer: A Spatially-Aware Transformer for Individual HRTF Upsampling in Immersive Audio Rendering

HRTFformer: 用于沉浸式音频渲染中个体HRTF上采样的空间感知Transformer

Xuyi Hu, Jian Li, Shaojie Zhang, Stefan Goetz, Lorenzo Picinali, Ozgur B. Akan, Aidan O. T. Hogg

发表机构 * SONICOM

AI总结 针对个体HRTF测量困难的问题,提出基于Transformer的HRTF上采样架构,利用注意力机制和球谐域处理,结合邻域差异损失,实现高保真HRTF重建。

Comments Accepted to IEEE Transactions on Multimedia 2026

详情
AI中文摘要

个体头相关传输函数(HRTF)正开始被引入许多商业沉浸式音频应用中,对于实现逼真的空间音频渲染至关重要。然而,引入它们的主要顾虑之一是,由于HRTF测量过程的复杂性,大规模创建个体HRTF并不实用。为缓解这一缺点,提出了HRTF空间上采样,旨在减少所需的测量量。尽管先前的工作已通过不同的机器学习方法取得成功,但这些模型通常难以在相邻源方向之间保持局部空间变化模式的长期一致性,以及在高上采样因子下的泛化能力。本文提出了一种新颖的基于Transformer的HRTF上采样架构,利用注意力机制更好地捕捉HRTF球面上的空间相关性。在球谐域中工作,我们的模型从稀疏输入测量中学习重建高分辨率HRTF,精度显著提高。为增强空间一致性,我们引入了邻域差异损失,促进幅度平滑性,从而产生更逼真的上采样。我们使用感知定位模型和客观频谱失真指标评估了我们的方法。实验表明,我们的模型在生成逼真、高保真HRTF方面,在多个评估指标上优于现有方法。

英文摘要

Individual Head-Related Transfer Functions (HRTFs) are starting to be introduced in many commercial immersive audio applications and are crucial for realistic spatial audio rendering. However, one of the main hesitations regarding their introduction is that creating individual HRTFs is impractical at scale due to the complexities of the HRTF measurement process. To mitigate this drawback, HRTF spatial upsampling has been proposed with the aim of reducing the measurements required. While prior work has seen success with different machine learning (ML) approaches, these models often struggle with long-range preservation of local spatial variation patterns across neighbouring source directions and generalization at high upsampling factors. In this paper, we propose a novel transformer-based architecture for HRTF upsampling, leveraging the attention mechanism to better capture spatial correlations across the HRTF sphere. Working in the spherical harmonic (SH) domain, our model learns to reconstruct high-resolution HRTFs from sparse input measurements with significantly improved accuracy. To enhance spatial coherence, we introduce a neighbour dissimilarity loss that promotes magnitude smoothness, yielding more realistic upsampling. We evaluate our method using both perceptual localization models and objective spectral distortion metrics. Experiments show that our model outperforms existing methods across several evaluation metrics in generating realistic, high-fidelity HRTFs.

2505.18614 2026-06-02 cs.CL cs.LG cs.MM cs.SD eess.AS 版本更新

MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation

MAVL:面向动画歌曲翻译的多语言音视频歌词数据集

Woohyun Cho, Youngmin Kim, Sunghyun Lee, Youngjae Yu

发表机构 * Yonsei University(延世大学) Seoul National University(首尔国立大学)

AI总结 提出首个多语言多模态歌词翻译基准MAVL,并设计音节约束的音视频大语言模型SylAVL-CoT,利用音视频线索和音节约束提升歌词可唱性和翻译准确性。

Comments Accepted to EMNLP 2025, Project Page: https://k1064190.github.io/papers/paper1.html, our codes and datasets are available at https://github.com/k1064190/MAVL

详情
AI中文摘要

歌词翻译需要同时实现准确的语义传递以及保留音乐节奏、音节结构和诗歌风格。在动画音乐剧中,由于需要与视觉和听觉线索对齐,挑战更加严峻。我们引入了多语言音视频歌词翻译基准(MAVL),这是首个用于可唱歌词翻译的多语言、多模态基准。通过整合文本、音频和视频,MAVL能够比纯文本方法实现更丰富、更具表现力的翻译。在此基础上,我们提出了音节约束的音视频大语言模型SylAVL-CoT,该模型利用音视频线索并施加音节约束,以生成自然流畅的歌词。实验结果表明,SylAVL-CoT在可唱性和上下文准确性方面显著优于基于文本的模型,强调了多模态、多语言方法在歌词翻译中的价值。

英文摘要

Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues. We introduce Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL), the first multilingual, multimodal benchmark for singable lyrics translation. By integrating text, audio, and video, MAVL enables richer and more expressive translations than text-only approaches. Building on this, we propose Syllable-Constrained Audio-Video LLM with Chain-of-Thought SylAVL-CoT, which leverages audio-video cues and enforces syllabic constraints to produce natural-sounding lyrics. Experimental results demonstrate that SylAVL-CoT significantly outperforms text-based models in singability and contextual accuracy, emphasizing the value of multimodal, multilingual approaches for lyrics translation.

2412.03771 2026-06-02 cs.SD cs.LG eess.AS 版本更新

Embedding-Space Diffusion for Zero-Shot Environmental Sound Classification

嵌入空间扩散用于零样本环境声音分类

Ysobel Sims, Alexandre Mendes, Stephan Chalup

发表机构 * School of Information and Physical Sciences, University of Newcastle, Australia(信息与物理科学学院,新南威尔士大学,澳大利亚)

AI总结 本文提出一种基于扩散模型的条件生成方法,用于零样本环境声音分类,在多个音频数据集上平均性能优于现有基线方法。

详情
AI中文摘要

零样本学习通过利用语义信息使模型能够泛化到未见过的类别,弥合训练集和测试集之间类别不重叠的差距。尽管大量研究集中在计算机视觉中的零样本学习,但这些方法在环境音频中的应用仍未被充分探索,现有研究性能较差。在计算机视觉中已证明成功的生成方法在零样本环境声音分类研究中明显缺失。为填补这一空白,本研究探索了环境音频中零样本学习的生成方法。我们改编了两种来自计算机视觉的成功生成模型:交叉对齐和分布对齐变分自编码器(CADA-VAE)以及利用不变侧生成对抗网络(LisGAN)。此外,我们引入了一种以类别辅助数据为条件的新型扩散模型。扩散模型生成的合成嵌入与已见类别嵌入结合,用于训练分类器。在五个环境音频数据集(ESC-50、ARCA23K-FSD、FSC22、UrbanSound8k和TAU Urban Acoustics 2019)和一个音乐分类数据集(GTZAN)上进行了实验。结果表明,扩散模型在六个音频数据集上的平均性能优于所有基线方法。这项工作确立了扩散模型作为零样本学习的一种有前景的方法,并引入了零样本环境声音分类生成方法的第一个基准,为未来研究提供了基础。

英文摘要

Zero-shot learning enables models to generalise to unseen classes by leveraging semantic information, bridging the gap between training and testing sets with non-overlapping classes. While much research has focused on zero-shot learning in computer vision, the application of these methods to environmental audio remains underexplored, with poor performance in existing studies. Generative methods, which have demonstrated success in computer vision, are notably absent from zero-shot environmental sound classification studies. To address this gap, this work investigates generative methods for zero-shot learning in environmental audio. Two successful generative models from computer vision are adapted: a cross-aligned and distribution-aligned variational autoencoder (CADA-VAE) and a leveraging invariant side generative adversarial network (LisGAN). Additionally, we introduced a novel diffusion model conditioned on class auxiliary data. Synthetic embeddings generated by the diffusion model are combined with seen class embeddings to train a classifier. Experiments are conducted on five environmental audio datasets, ESC-50, ARCA23K-FSD, FSC22, UrbanSound8k and TAU Urban Acoustics 2019, and one music classification dataset, GTZAN. Results show that the diffusion model outperforms all baseline methods on average across six audio datasets. This work establishes the diffusion model as a promising approach for zero-shot learning and introduces the first benchmark of generative methods for zero-shot environmental sound classification, providing a foundation for future research.