arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.02448 2026-06-02 eess.SP cs.SD 版本更新

通过语音-文本表示学习推进电喉语音增强

Ding Ma, Jinyi Mi, Fengji Li, Lester Phillip Violeta, Jiajun He, Wenchin Huang, Kazuhiro Kobayashi, Tomoki Toda

发表机构 * Graduate School of Informatics, Nagoya University（名古屋大学信息学研究科）； School of Biological Science and Medical Engineering, Beihang University（北航生物医学工程学院）； TARVO, Inc.（TARVO公司）； Information Technology Center, Nagoya University（名古屋大学信息技术中心）

AI总结提出一种融合语音和文本表示的学习框架，通过序列到序列语音转换模型改进电喉语音到正常语音的映射与重建质量，实验证明优于仅依赖语音表示的方法。

Comments 15 pages, 7 figures. Accepted to IEEE TBME

详情

DOI: 10.1109/TBME.2026.3694703
Journal ref: IEEE Transactions on Biomedical Engineering, Early Access, 2026

AI中文摘要

目的：喉切除者依赖机电设备产生电喉（EL）语音。与正常语音相比，EL语音存在严重失真、有限的语音变化、不自然的韵律和时间偏移，降低了自然度和可懂度。尽管基于序列到序列（seq2seq）语音转换（VC）的EL语音到正常语音转换（EL2SP）很有前景，但EL与正常语音之间的显著不匹配不可避免地导致累积映射误差，限制了性能。为解决这一问题，我们描述了一种新颖的表示学习框架，该框架整合语音和文本表示，以改善seq2seq VC模型内的映射和重建质量。方法：我们的方法包括两个主要阶段：1）表示整合与学习，以及2）重建训练。首先构建一个能够融入辅助文本信息的网络，使用预训练模块学习基于语音-文本的整合表示。然后，采用自编码器风格的重建策略完成EL2SP模型，以继承这些表示而不增加模型复杂度。我们引入了三种融合策略，包括中级、输入级和混合级融合策略，逐步增强学习。此外，除了标准的seq2seq VC目标外，还引入了对整合表示的额外重建损失，以细化表示迁移。结果：在不同EL2SP数据集上的实验一致表明，我们的方法结合数据增强，优于仅依赖语音表示的基线方法。此外，随着系统设计深度的逐步改进验证了我们方法的有效性。意义：所提出的方法为EL语音增强和辅助通信技术提供了一种可扩展且实用的方法。

英文摘要

Objective: laryngectomees depend on an electromechanical device to generate electrolaryngeal (EL) speech. Compared with normal speech, EL speech suffers from severe distortion, limited phonetic variation, unnatural prosody, and temporal shifts, degrading naturalness and intelligibility. Although sequence-to-sequence (seq2seq) voice conversion (VC) based EL-speech-to-normal-speech conversion (EL2SP) is promising, substantial mismatches between EL and normal speech inevitably cause cumulative mapping errors that limit performance. To address this, we describe a novel representation learning framework integrating speech and text representations to improve mapping and reconstruction quality within a seq2seq VC model. Methods: our methodology comprises two main stages: 1) representation integration and learning, and 2) reconstruction training. A network capable of incorporating auxiliary text information is first constructed with pretrained modules to learn speech--text-based integrated representations. Then, an autoencoder-style reconstruction strategy finalizes EL2SP model to inherit these representations without increasing model complexity. We introduce three fusion strategies including middle-, input-, and hybrid-level fusion strategies that progressively enhance learning. Moreover, besides standard seq2seq VC objectives, an additional reconstruction loss on the integrated representation is introduced to refine representation transfer. Results: experiments under different EL2SP datasets consistently demonstrate that our methods, combined with data augmentations, outperform baselines relying solely on speech representations. Furthermore, progressive improvements with system design depth validate the effectiveness of our methods. Significance: the proposed methods provide an extensible and practical methodology for EL speech enhancement and assistive communication technologies.

URL PDF HTML ☆

赞 0 踩 0

2606.01703 2026-06-02 cs.SD cs.AI cs.CV 版本更新

JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions

JenBridge: 跨场景转换的自适应长视频配乐

Jiashuo Yu, Yao Yao, Boyu Chen, Alex Wang

发表机构 * Jen Music AI

AI总结提出JenBridge框架，通过基于Transformer的生成模型、双文本-视觉条件对齐和LLM代理驱动的自适应过渡机制，实现长视频配乐的高保真生成与场景转换自然连贯。

详情

AI中文摘要

我们解决了在场景转换中生成高保真、长格式配乐并保持连贯性的挑战。现有的AI音乐系统主要针对短片段设计，缺乏确保叙事连续性的机制。我们提出了JenBridge，一个模块化且可解释的自适应长视频配乐框架，确保高保真音频生成和转换自然性。核心架构是一个基于Transformer的生成模型，采用流匹配目标训练，遵循两阶段范式：在大规模文本-音频语料库上进行预训练以建立稳健的音乐先验，然后通过双文本-视觉条件适应视频领域以实现精确的跨模态对齐。关键的是，为了实现跨不同场景变化的长格式连贯性，JenBridge引入了一种新颖的自适应过渡机制。该系统具有一个多功能的过渡风格工具包，包括一种生成式过渡方法，并独特地采用了一个大型语言模型（LLM）代理，作为导演智能地为每个叙事转变选择最合适的过渡。为了严格评估这一任务，我们提出了LVS基准，这是一个新基准，包含一个精选数据集和新的评估指标，侧重于整体和过渡感知评估。在提出的基准上进行的大量实验表明，JenBridge在客观和主观指标上均显著优于现有方法，特别是在转换自然性和整体叙事连贯性方面。JenBridge代表了向全自动、专业质量的视频配乐迈出的重要一步。

英文摘要

We address the challenge of generating high-fidelity, long-form soundtracks that remain coherent across scene transitions. Existing AI music systems are mainly designed for short, isolated clips and lack mechanisms to ensure narrative continuity. We present JenBridge, a modular and interpretable framework for adaptive long-form video soundtracking that ensures both high-fidelity audio generation and transition naturalness. The core architecture is a Transformer-based generative model trained with a flow-matching objective, following a two-stage paradigm: pretraining on large-scale text-audio corpora to establish robust musical priors, then adapting to the video domain with dual text-visual conditioning for precise cross-modal alignment. Crucially, to achieve long-form coherence across diverse scene changes, JenBridge incorporates a novel adaptive transition mechanism. This system features a versatile toolkit of transition styles, including a generative transition method, and uniquely employs a Large Language Model (LLM) Agent that acts as a director to select the most appropriate transition for each narrative shift intelligently. To rigorously assess this task, we propose the LVS Benchmark, a new benchmark that includes a curated dataset and novel evaluation metrics focusing on holistic and transition-aware assessment. Extensive experiments on the proposed benchmark demonstrate that JenBridge significantly outperforms existing methods in both objective and subjective metrics, particularly in terms of transition naturalness and overall narrative coherence. JenBridge represents a significant step towards fully automated, professional-quality video soundtracking.

URL PDF HTML ☆

赞 0 踩 0

2606.01686 2026-06-02 cs.SD cs.AI 版本更新

1000小时日语语音产生的EEG-EMG-音频数据集

Motoshige Sato, Ilya Horiguchi, Masakazu Inoue, Kenichi Tomeoka, Eri Hatakeyama, Yuya Kita, Atsushi Yamamoto, Ippei Fujisawa, Shuntaro Sasai

发表机构 * National Institute of Information and Communications Technology, Japan（日本信息与通信技术研究所）

AI总结本研究构建了一个包含1020小时同步头皮脑电图、面部肌电图和语音音频的多模态数据集，来自三名健康日语母语者在开放词汇有声语音过程中的记录，旨在支持语音解码、多模态信号处理及脑电图表示学习等研究。

详情

AI中文摘要

我们提出了一个多模态数据集，包含来自三名健康日语母语者在开放词汇有声语音过程中同步记录的1020小时头皮脑电图（EEG）、面部肌电图（EMG）和语音音频。记录使用三种EEG系统——超高密度系统（g.Pangolin）和两种帽式系统（g.SCARABEO和eegosports），通道数从62到128不等——在数月内跨多个会话采集。每个会话提供时间同步的EEG、面部EMG和音频，以及语音事件注释和转录。尽管数据集的主要动机是语音解码，但它也支持多模态信号处理、伪影建模、纵向和跨设备适应以及EEG表示学习等工作。技术验证包括跨参与者、设备和任务的功率谱密度和事件相关电位分析，显示了预期的1/f频谱轮廓、任务相关的alpha频段衰减和时间锁定的诱发响应。该数据集以脑成像数据结构（BIDS）格式通过OpenNeuro在CC0豁免下发布，以支持语音相关及更广泛的EEG研究。

英文摘要

We present a multimodal dataset of 1020 hours of simultaneously recorded scalp electroencephalography (EEG), facial electromyography (EMG), and speech audio from three healthy native Japanese speakers during open-vocabulary overt speech. Recordings were acquired with three EEG systems-an ultra-high-density system (g.Pangolin) and two cap-type systems (g.SCARABEO and eegosports), spanning 62-128 channels-across many sessions over several months. Each session provides time-synchronized EEG, facial EMG, and audio, together with speech-event annotations and transcriptions. Although collected with speech decoding as a primary motivation, the dataset also supports work on multimodal signal processing, artifact modeling, longitudinal and cross-device adaptation, and EEG representation learning. Technical validation included power spectral density and event-related potential analyses across participants, devices, and tasks, which showed the expected 1/f spectral profile, task-related alpha-band attenuation, and time-locked evoked responses. The dataset is released in Brain Imaging Data Structure (BIDS) format via OpenNeuro under a CC0 waiver to support both speech-related and broader EEG research.

URL PDF HTML ☆

赞 0 踩 0

2606.01135 2026-06-02 cs.NE cs.SD 版本更新

Spiking and Event-driven Neuromorphic Mamba Models for Efficient Speech Recognition

用于高效语音识别的脉冲与事件驱动神经形态Mamba模型

Tauseef Ahmed, Tao Sun, Jeronimo Castrillon, Kanishkan Vadivel, Guangzhi Tang

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Tsinghua University（清华大学）

AI总结提出脉冲和事件驱动的神经形态Mamba模型，通过激活稀疏性提升语音识别效率，在LibriSpeech上实现超过60%的激活稀疏性且精度损失小于1%，并开发周期精确的事件驱动模拟器实现算法-硬件协同优化。

Comments Accepted at IJCNN2026

详情

AI中文摘要

深度学习极大地推动了自动语音识别（ASR）的发展，使其能够广泛部署在智能手机和智能家居系统等边缘设备上。然而，深度神经网络的计算和能量需求给这种资源受限的部署带来了巨大挑战，导致延迟并限制了实时交互。神经形态计算通过脉冲神经网络（SNN）和事件驱动神经网络引入激活稀疏性，将密集运算转换为稀疏计算，提供了一种有前景的解决方案。然而，对于ASR，评估不同神经形态策略硬件优势的研究仍然缺乏。本文探索了脉冲和事件驱动的神经形态神经网络，以改进用于ASR的最先进SpeechMamba模型中的激活稀疏性。我们引入了一个带有FATReLU激活的事件驱动SpeechMamba，在LibriSpeech上实现了超过60%的激活稀疏性，且精度下降不到1%。我们还提出了一个脉冲SpeechMamba，其稀疏性超过70%，同时参数比同类SNN少30%。最后，我们开发了一个周期精确的事件驱动模拟器，实现了灵活的算法-硬件协同探索，帮助我们识别计算瓶颈，并带来超过10%的额外效率提升。

连续归一化流用于分布外检测的局部诊断

Xinwei Cao, Mengxuan Lu, Torbjørn Svendsen, Giampiero Salvi

发表机构 * Department of Electronic Systems（电子系统系）； Norwegian University of Science and Technology（挪威科学技术大学）； Trondheim, Norway（特伦德内克，挪威）

AI总结针对高维数据子空间中目标观测的分布外检测问题，提出基于连续归一化流的拉格朗日子流框架，通过速度场几何诊断信号设计零样本音素级发音错误检测指标，优于基于似然的方法。

Comments 16 pages, 5 figures

详情

AI中文摘要

我们解决了嵌入在高维数据空间子空间中的目标观测的分布外（OOD）检测问题。利用连续归一化流（CNFs），我们提出了一个拉格朗日子流（LSF）框架，旨在隔离并估计表示中相关分量的密度，同时将剩余分量作为上下文。通过对语音合成模型的实验，我们表明CNFs与其他深度生成模型（DGMs）类似，容易受到“似然悖论”的影响，即OOD样本被错误地赋予高似然。这归因于DGMs的归纳偏差，即优先考虑低级结构细节而非高级语义一致性。为了缓解这一现象，我们提出了基于子流轨迹上速度场的若干几何诊断信号。基于这些信号，我们为零样本音素级发音错误检测这一具有挑战性的任务设计了指标。最后，我们在一个真实的发音错误检测基准上展示了这些指标相对于基于似然的方法的优越性。

英文摘要

We address the problem of out-of-distribution (OOD) detection for target observations embedded in a subspace of the high dimensional data space. Using continuous normalizing flows (CNFs), we propose a Lagrangian sub-flow (LSF) framework designed to isolate and estimate the density for the relevant components in the representation and using the remaining components as context. Through experimentation with models for speech synthesis, we show that CNFs, similarly to other deep generative models (DGMs), are susceptible to the "likelihood paradox", where high likelihood is erroneously assigned to OOD samples. This is attributed to the inductive bias of DGMs that prioritize low-level structural details over high-level semantic coherence. To mitigate this phenomenon, we propose a number of geometric diagnostic signals based on the velocity field over the sub-flow trajectory. Based on these signals, we design metrics for the challenging task of zero-shot phoneme-level mispronunciation detection. Finally, we demonstrate the superiority of these metrics compared to likelihood-based methods on a real-world mispronunciation detection benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.00670 2026-06-02 cs.SD cs.AI 版本更新

Beyond the Mouth: Upper-Face Affective Cues in Audiovisual Sentence Recognition under Acoustic Uncertainty

超越口部：声学不确定性下视听句子识别中的上半脸情感线索

Zhou Yang, Yueyi Yang

发表机构 * Faculty of Education and Psychology, University of Oulu, Finland（奥卢大学教育与心理学学院，芬兰）； Center for Machine Vision and Signal Analysis, University of Oulu, Finland（奥卢大学机器视觉与信号分析中心，芬兰）

AI总结本研究利用CREMA-D语料库，通过特征分类器探究在声学退化条件下，上半脸情感信息是否有助于视听句子识别，发现上半脸情感线索能提升模型校准和鲁棒性。

详情

AI中文摘要

面对面言语理解本质上是多模态的，整合了声学信号与可见的发音、面部表情、头部运动及其他社交相关线索。虽然视听言语系统通常将口部区域作为语言信息的主要视觉来源，但情感面部表情常被单独视为情感识别目标。本文研究在声学退化条件下，上半脸情感信息是否有助于视听句子识别，超越音频和口部区域线索。使用CREMA-D视听情感言语语料库，我们在四种线索条件下训练基于特征的句子分类器：仅音频（A）、音频加口部/下半脸特征（A+M）、音频加上半脸特征（A+U）以及音频加口部和上半脸特征（A+M+U）。模型在干净音频和粉红噪声条件下（+10 dB、+5 dB和0 dB SNR）进行评估，采用演员独立划分。结果表明，在退化音频下，口部/下半脸特征提供了显著的鲁棒性优势。在0 dB SNR下，A+M相比A准确率提升0.0794，演员自举95%置信区间为[0.0296, 0.1298]。上半脸情感线索表现出更微妙的效果。尽管A+M+U相比A+M的直接准确率增益很小，但全脸模型在不同SNR水平上持续改善校准，并且在噪声条件下优于打乱的上半脸对照。这些发现表明，情感面部信息可能支持声学不确定性下的多模态鲁棒性和置信度估计，而不直接编码词汇内容。更广泛地说，该研究强调了社交表达性面部线索在以人为中心的视听交互系统中的潜在作用。

英文摘要

Face-to-face speech comprehension is inherently multimodal, integrating acoustic signals with visible articulation, facial expression, head motion, and other socially relevant cues. While audiovisual speech systems typically focus on the mouth region as the primary visual source of linguistic information, affective facial expressions are often treated separately as emotion-recognition targets. This paper investigates whether upper-face affective information contributes to audiovisual sentence recognition beyond audio and mouth-region cues, particularly under acoustic degradation. Using the CREMA-D audiovisual emotional speech corpus, we train feature-based sentence classifiers under four cue conditions: audio only (A), audio plus mouth/lower-face features (A+M), audio plus upper-face features (A+U), and audio plus both mouth and upper-face features (A+M+U). Models are evaluated on clean audio and pink-noise conditions at +10 dB, +5 dB, and 0 dB SNR using actor-independent splits. Results show that mouth/lower-face features provide substantial robustness benefits under degraded audio. At 0 dB SNR, A+M improves accuracy over A by 0.0794, with an actor-bootstrap 95% confidence interval of [0.0296, 0.1298]. Upper-face affective cues exhibit a more nuanced effect. Although the direct accuracy gain of A+M+U over A+M is small, full-face models consistently improve calibration across SNR levels and outperform shuffled upper-face controls under noisy conditions. These findings suggest that affective facial information may support multimodal robustness and confidence estimation under acoustic uncertainty without directly encoding lexical content. More broadly, the study highlights the potential role of socially expressive facial cues in human-centered audiovisual interaction systems.

URL PDF HTML ☆

赞 0 踩 0

2606.00629 2026-06-02 cs.SD cs.HC cs.LG eess.AS 版本更新

Quality Audio Prototyping: a prototype system for unified sound retrieval and procedural generation

质量音频原型：统一声音检索与程序化生成的系统原型

Nelly Garcia, Aditya Bhattacharjee, Gabryel Mason-Williams, Israel Mason-Williams, Emmanouil Benetos, Joshua Reiss

发表机构 * GitHub

AI总结提出QuAP系统，通过统一基于内容的音频检索和实时程序化生成，并集成规则辅助参数指导，降低声音设计中的操作距离，经主观评估和用户测试验证了其有效性和实用性。

Comments DaFx 2026

详情

AI中文摘要

声音设计工作流经常在耗时的库搜索和复杂的程序化合成之间摇摆，从业者通常依赖独立的工具分别应对每个挑战。本文介绍了质量音频原型（QuAP），一个工作原型，它在单一界面中统一了基于内容的音频检索和程序化声音生成，减少了叙事概念与其声音实现之间的操作距离。QuAP集成了基于相似性的检索引擎与实时程序化音频模型，并辅以基于规则的助手，提供基于感知的参数指导，给出源自经验优化的定义和建议，而不需要先验的合成知识。初步评估证实了这种方法的可行性：主观评估显示六个嵌入合成模型中有五个在质量上具有统计显著性的提升，编码器消融研究在音效数据集上确立了首选的检索架构。与16名从业者的用户评估证实了该工具的工作流实用性，所有参与者一致认为参数助手在保持创作自主性的同时降低了程序化交互的门槛。

英文摘要

Sound design workflows frequently oscillate between time-consuming library searches and the complexity of procedural synthesis, with practitioners typically relying on disconnected tools to address each challenge separately. This paper introduces Quality Audio Prototyping (QuAP), a working prototype that unifies content-based audio retrieval and procedural sound generation within a single interface, reducing the procedural distance between a narrative concept and its sonic realisation. QuAP integrates a similarity-based retrieval engine with real-time procedural audio models, complemented by a rule-based assistant that provides perceptually informed parameter guidance, offering definitions and recommendations derived from empirical optimisation rather than requiring prior synthesis knowledge. Preliminary evaluation confirms the viability of this approach: subjective assessment demonstrated statistically significant quality improvements in five of six embedded synthesis models, and an encoder ablation study established the preferred retrieval architecture on a sound effect dataset. A user evaluation with 16 practitioners confirmed the tool's workflow utility, with all participants agreeing that the parameter assistant preserved creative agency while lowering the barrier to procedural interaction.

URL PDF HTML ☆

赞 0 踩 0

2606.00081 2026-06-02 cs.LG cs.AI cs.SD 版本更新

DAStatFormer: A Hybrid Multibranch Transformer with Statistical Feature Integration for DAS-Based Pattern Recognitions

DAStatFormer: 一种融合统计特征的混合多分支Transformer用于DAS模式识别

Michel Dione, Jerry Lonlac, Hélène Louis, Anthony Fleury, Stephane Lecoeuche

发表机构 * IMT Nord Europe, Institut Mines-Telecom, Univ. Lille, Centre for Digital Systems Lille, France（IMT北欧学院，法国电信研究院，里尔大学，数字系统研究中心，法国）； IMT Mines Ales, Institut Mines-Telecom, Ales, France（IMT阿尔勒学院，法国电信研究院，阿尔勒，法国）

AI总结针对DAS数据高维度和复杂时空模式问题，提出DAStatFormer混合多分支Transformer，通过提取24个ANOVA选择的统计特征并采用门控Transformer网络，在降低数据量级的同时实现高达99.4%的准确率。

详情

AI中文摘要

分布式声学传感（DAS）通过光纤实现大规模监测，但其高维度和复杂的时空模式使得事件分类具有挑战性。现有的深度学习方法——CNN、循环模型和Transformer变体——要么无法捕获长程依赖，要么需要以高昂成本处理原始DAS矩阵。我们提出DAStatFormer，一种混合多分支Transformer，将紧凑的多域统计特征与门控Transformer网络相结合。我们不是使用原始信号，而是从每个通道的时域、波形和频域提取24个ANOVA选择的属性，将数据量减少数个数量级，同时保留判别信息。每个域通过专用的逐步骤和逐通道注意力分支处理，并通过自适应门控机制融合。在开放的$\Phi$-OTDR基准测试和真实场景DAS数据集上的实验表明，DAStatFormer实现了高达99.4%的准确率和接近完美的实际性能，同时使用的参数和推理成本显著低于DASFormer和DeepViT等模型。这些结果证明了其适用于可扩展、实时的DAS监测。我们在https://github.com/MichelD-git/DAStatFormer发布代码。

英文摘要

Distributed Acoustic Sensing (DAS) enables large-scale monitoring through optical fibers, but its high dimensionality and complex spatio-temporal patterns make event classification demanding. Existing deep learning approaches-CNNs, recurrent models, and Transformer variants-either fail to capture long-range dependencies or require processing raw DAS matrices at prohibitive cost. We propose DAStatFormer, a hybrid multibranch Transformer that combines compact multidomain statistical features with Gated Transformer Networks. Instead of raw signals, we extract 24 ANOVA-selected attributes per channel from the temporal, waveform, and spectral domains, reducing data size by orders of magnitude while preserving discriminative information. Each domain is processed via dedicated step-wise and channel-wise attention branches, fused by an adaptive gating mechanism. Experiments on the open $Φ$-OTDR benchmark and a real-scenario DAS dataset show that DAS-tatFormer achieves up to 99.4% accuracy and near-perfect real-world performance, while using significantly fewer parameters and lower inference cost than models such as DASFormer and DeepViT. These results demonstrate its suitability for scalable, real-time DAS-based monitoring. We release our code at https://github.com/MichelD-git/DAStatFormer

URL PDF HTML ☆

赞 0 踩 0

2606.00066 2026-06-02 cs.SD eess.AS 版本更新

BEAT: 通过均匀时间步对符号音乐进行分词和生成

Lekai Qian, Haoyu Gu, Jingwei Zhao, Ziyu Wang

发表机构 * South China University of Technology（南方科技大学）； National University of Singapore（新加坡国立大学）； Mohamed bin Zayed University of Artificial Intelligence（莫扎德·本·扎耶德人工智能大学）； New York University（纽约大学）

AI总结提出一种以均匀节拍为基本单元的分词方法，将同一时间步内相同音高的所有事件编码为一个令牌，并在音乐续写和伴奏生成任务中验证其相比传统事件基方法能提升音乐质量和结构连贯性。

详情

AI中文摘要

将音乐分词以适应语言模型的通用框架是一个具有挑战性的问题，特别是考虑到音乐可以表示的各种符号结构（例如，序列、网格和图）。迄今为止，大多数方法将符号音乐分词为音乐事件序列，如起始、音高、时移或复合音符事件。这种策略直观且已在基于Transformer的模型中证明有效，但它隐式处理了音乐时间的规律性：单个令牌可能跨越不同时长，导致时间进展不均匀。在本文中，我们考虑另一种分词方式是否可能，其中均匀长度的音乐步长（例如，一个节拍）作为基本单元。具体来说，我们将单个时间步内相同音高的所有事件编码为一个令牌，并显式按时间步对令牌进行分组，这类似于钢琴卷帘表示的稀疏编码。我们在音乐续写和伴奏生成任务上评估了所提出的分词方法，并将其与主流事件基方法进行比较。结果表明，所提出的分词方法提高了音乐质量和结构连贯性，而额外分析证实了更高的效率和更有效地捕获长程模式。

英文摘要

Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences of musical events, such as onsets, pitches, time shifts, or compound note events. This strategy is intuitive and has proven effective in Transformer-based models, but it treats the regularity of musical time implicitly: individual tokens may span different durations, resulting in non-uniform time progression. In this paper, we instead consider whether an alternative tokenization is possible, where a uniform-length musical step (e.g., a beat) serves as the basic unit. Specifically, we encode all events within a single time step at the same pitch as one token, and group tokens explicitly by time step, which resembles a sparse encoding of a piano-roll representation. We evaluate the proposed tokenization on music continuation and accompaniment generation tasks, comparing it with mainstream event-based methods. Results show improved musical quality and structural coherence, while additional analyses confirm higher efficiency and more effective capture of long-range patterns with the proposed tokenization.

URL PDF HTML ☆

赞 0 踩 0

2605.07061 2026-06-02 cs.SD cs.AI cs.CV cs.MM 版本更新

Do Joint Audio-Video Generation Models Understand Physics?

联合音视频生成模型是否理解物理？

Zijun Cui, Xiulong Liu, Hao Fang, Mingwei Xu, Jiageng Liu, Zexin Xu, Weiguo Pian, Shijian Deng, Feiyu Du, Chenming Ge, Yapeng Tian

发表机构 * University of Texas at Dallas（德克萨斯大学达拉斯分校）； University of Washington（华盛顿大学）； University of California, Los Angeles（加州大学洛杉矶分校）

AI总结针对联合音视频生成模型，提出AV-Phys Bench基准测试其物理常识，发现所有模型在物理一致性上表现不足，尤其是事件驱动和环境驱动转换场景。

Comments Preprint. Project Page: https://zijuncui.com/AV-Phys/. Full abstract appears in the PDF

详情

AI中文摘要

联合音视频生成模型正迅速接近专业制作质量，这引发了一个核心问题：它们是否理解音视频物理，还是仅仅生成看似合理但违反现实一致性的声音和帧？我们引入了AV-Phys Bench，一个用于评估联合音视频生成中物理常识的基准。AV-Phys Bench测试模型在三种场景类别上的表现：稳态、事件转换和环境转换。它涵盖了从现实场景中提取的基于物理的子类别，以及故意要求物理不一致音视频行为的反AV物理提示。每个生成结果沿五个维度评估：视觉语义遵循、音频语义遵循、视觉物理常识、音频物理常识和跨模态物理常识。在三个专有模型和四个开源模型中，我们发现Seedance 2.0整体表现最佳，但所有模型距离鲁棒的物理理解仍有很大差距。在事件驱动和环境驱动转换上性能急剧下降，即使是强大的专有系统在反AV物理提示上也崩溃。我们进一步引入了AV-Phys Agent，一个结合多模态语言模型与确定性声学测量工具的ReAct风格评估器，产生的排名与人类评分高度一致。我们的结果指出，跨模态物理一致性和转换驱动的场景动态是联合音视频生成的关键开放挑战。

英文摘要

Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand audio-visual physics, or merely generate plausible sounds and frames that violate real-world consistency? We introduce AV-Phys Bench, a benchmark for evaluating physical commonsense in joint audio-video generation. AV-Phys Bench tests models across three scene categories: Steady State, Event Transition, and Environment Transition. It covers physics-grounded subcategories drawn from real-world scenes, plus Anti-AV-Physics prompts that deliberately request physically inconsistent audio-video behavior. Each generation is evaluated along five dimensions: visual semantic adherence, audio semantic adherence, visual physical commonsense, audio physical commonsense, and cross-modal physical commonsense. Across three proprietary and four open-source models, we find that Seedance 2.0 performs best overall, but all models remain far from robust physical understanding. Performance drops sharply on event-driven and environment-driven transitions, and even strong proprietary systems collapse on Anti-AV-Physics prompts. We further introduce AV-Phys Agent, a ReAct-style evaluator that combines a multimodal language model with deterministic acoustic measurement tools, producing rankings that closely align with human ratings. Our results identify cross-modal physical consistency and transition-driven scene dynamics as key open challenges for joint audio-video generation.

URL PDF HTML ☆

赞 0 踩 0

2604.18360 2026-06-02 cs.SD cs.CL 版本更新

Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval

Omni-Embed-Audio: 利用多模态大语言模型实现鲁棒的音频-文本检索

HaeJun Yoo, Yongseop Shin, Insung Lee, Myoung-Wan Koo, Du-Seong Chang

发表机构 * Sogang University（首尔大学）

AI总结提出Omni-Embed-Audio（OEA）检索编码器，利用多模态大语言模型原生理解音频，并通过用户意图查询（UIQ）和硬负样本挖掘，在文本到音频检索中达到与M2D-CLAP相当的性能，同时在文本到文本检索和硬负样本判别上显著优于现有方法。

Comments Accepted at ACL 2026 Main Conference. Camera-ready version

详情

AI中文摘要

基于对比语言-音频预训练（CLAP）的音频-文本检索系统在传统基准上表现强劲；然而，这些基准依赖于与真实世界搜索行为差异显著的标题风格查询，限制了其对实际检索鲁棒性的评估。我们提出了Omni-Embed-Audio（OEA），一种利用具有原生音频理解能力的多模态大语言模型的检索导向编码器。为了系统评估超越标题风格查询的鲁棒性，我们引入了用户意图查询（UIQ）——五种反映自然搜索行为的表述形式：问题、命令、关键词标签、释义和基于排除的负查询。对于负查询，我们开发了一个硬负样本挖掘管道，并提出了判别指标（HNSR, TFR），评估模型抑制声学相似干扰物的能力。在AudioCaps、Clotho和MECAT上的实验表明，OEA在文本到音频检索性能上与最先进的M2D-CLAP相当，同时在两个关键领域展现出明显优势：（1）主导的文本到文本检索（相对提升22%），以及（2）显著优越的硬负样本判别（HNSR@10提升4.3个百分点，TFR@10相对提升34.7%），揭示了大语言模型骨干对复杂查询具有更优的语义理解能力。

英文摘要

Audio-text retrieval systems based on Contrastive Language-Audio Pretraining (CLAP) achieve strong performance on traditional benchmarks; however, these benchmarks rely on caption-style queries that differ substantially from real-world search behavior, limiting their assessment of practical retrieval robustness. We present Omni-Embed-Audio (OEA), a retrieval-oriented encoder leveraging multimodal LLMs with native audio understanding. To systematically evaluate robustness beyond caption-style queries, we introduce User-Intent Queries (UIQs) - five formulations reflecting natural search behaviors: questions, commands, keyword tags, paraphrases, and exclusion-based negative queries. For negative queries, we develop a hard negative mining pipeline and propose discrimination metrics (HNSR, TFR) assessing models' ability to suppress acoustically similar distractors. Experiments on AudioCaps, Clotho, and MECAT show that OEA achieves comparable text-to-audio retrieval performance to state-of-the-art M2D-CLAP, while demonstrating clear advantages in two critical areas: (1) dominant text-to-text retrieval (+22% relative improvement), and (2) substantially superior hard negative discrimination (+4.3%p HNSR@10, +34.7% relative TFR@10), revealing that LLM backbones provide superior semantic understanding of complex queries.

URL PDF HTML ☆

赞 0 踩 0

2604.01562 2026-06-02 cs.SD cs.AI cs.CL cs.CY cs.HC 版本更新

Acoustic and perceptual differences between standard and accented speech and their voice clones

标准口音与带口音语音及其语音克隆的声学与感知差异

Tianle Yang, Chengzhe Sun, Phil Rose, Siwei Lyu

发表机构 * Department of Linguistics, University at Buffalo, United States（语言学系，布法罗大学，美国）； Department of Computer Science and Engineering, University at Buffalo, United States（计算机科学与工程系，布法罗大学，美国）； Emeritus Faculty, Australian National University, Australia（澳大利亚国立大学荣誉教职）

AI总结通过计算和感知实验，比较标准口音与带口音普通话及其语音克隆，发现口音影响感知身份匹配和可懂度，且标准口音克隆更接近原声，带口音克隆可懂度提升更大。

详情

AI中文摘要

语音克隆通常根据整体质量进行评估，但关于口音保留及其感知后果的了解较少。我们采用计算和感知相结合的设计，比较标准口音和重度口音普通话及其语音克隆。基于嵌入的分析显示，在多个说话人判别嵌入空间中，带口音说话人的原始-克隆距离更大，但在根据每个说话人的原始内部基线变异性进行归一化后，这种差异消失。在感知研究中，标准口音说话人的克隆被评价为比带口音说话人的克隆更接近其原始声音，并且从原始到克隆的可懂度增加，其中带口音语音的增益更大。这些结果表明，即使口音变异未反映在基线归一化的说话人嵌入距离中，它也能影响语音克隆中的感知身份匹配和可懂度，并促使将口音保留视为说话人身份保留的一个明确组成部分，而不是假设它完全由现成的说话人判别嵌入所捕获。

英文摘要

Voice cloning is often evaluated in terms of overall quality, but less is known about accent preservation and its perceptual consequences. We compare standard and heavily accented Mandarin speech and their voice clones using a combined computational and perceptual design. Embedding-based analyses showed larger original-clone distances for accented speakers in several speaker-discriminative embedding spaces, but this difference disappeared after normalizing against each speaker's within-original baseline variability. In the perception study, clones are rated as more similar to their originals for standard than for accented speakers, and intelligibility increases from original to clone, with a larger gain for accented speech. These results show that accent variation can shape perceived identity match and intelligibility in voice cloning even when it is not reflected in baseline-normalized speaker-embedding distance, and they motivate treating accent preservation as an explicit component of speaker identity preservation, rather than assuming that it is fully captured by off-the-shelf speaker-discriminative embeddings.

URL PDF HTML ☆

赞 0 踩 0

2510.00180 2026-06-02 eess.AS cs.SD eess.SP 版本更新

DiffAU: Diffusion-Based Ambisonics Upscaling

DiffAU: 基于扩散的Ambisonics升阶

Amit Milstein, Nir Shlezinger, Boaz Rafaely

发表机构 * Technion - Israel Institute of Technology（技术学院 - 以色列理工学院）

AI总结提出DiffAU方法，利用扩散模型和空间音频适配，从一阶Ambisonics生成三阶Ambisonics，实现快速可靠的升阶。

2602.02557 2026-06-02 cs.LG cs.AI cs.SD 版本更新

The Alignment Curse: Modality Alignment Supercharges Audio Attacks via Text Transfer

对齐诅咒：模态对齐通过文本传输增强音频攻击

Yupeng Chen, Junchi Yu, Aoxi Liu, Baoyuan Wu, Philip Torr, Adel Bibi

发表机构 * University of Oxford（牛津大学）

AI总结本文提出并验证了“对齐诅咒”原理，即更强的文本-音频模态对齐会促进文本攻击向音频的迁移，并通过黑盒实验表明文本转移的音频攻击性能与原生音频攻击相当甚至更优，揭示了能力与安全之间的根本矛盾。

Comments 23 pages, 5 figures

详情

AI中文摘要

近期端到端训练的全能模型通过加强文本-音频模态对齐显著提升了音频能力。然而，这种对齐是否无意中促进了安全漏洞跨模态的转移仍未被充分探索。这一问题至关重要，因为基于文本的越狱攻击远比基于音频的攻击成熟；如果它们系统性转移，当前的音频安全评估可能低估源自文本模态的风险。在本文中，我们引入了“对齐诅咒”，这是一个经过形式化表征和实证验证的原理，表明更强的模态对齐使得攻击从文本到音频的转移更有效，揭示了能力与安全之间的根本矛盾。基于这一原理，我们在最新的全能模型（如Qwen2.5-Omni、Qwen3-Omni）上对三类攻击（文本攻击、文本转移的音频攻击和音频攻击）进行了全面的黑盒评估。我们发现，文本转移的音频攻击与基于音频的攻击表现相当，甚至更优，在仅音频访问下展现出明显优势。这表明基于文本的漏洞在塑造音频安全风险中扮演关键角色。最后，我们实证分析了不同攻击方法和模型下模态对齐与转移有效性之间的关系，观察到对“对齐诅咒”的一致支持：更紧密的模态对齐导致更有效的跨模态攻击转移。

英文摘要

Recent advances in end-to-end trained omni-models have substantially improved audio capabilities by strengthening text-audio modality alignment. However, whether such alignment inadvertently facilitates the transfer of safety vulnerabilities across modalities remains underexplored. This question is critical as text-based jailbreak attacks are considerably more mature than audio-based ones; if they transfer systematically, current audio safety evaluations may underestimate risks originating from the text modality. In this paper, we introduce the Alignment Curse, a formally characterized and empirically validated principle showing that stronger modality alignment enables more effective transfer of attacks from text to audio, revealing a fundamental tension between capability and safety. Motivated by this principle, we conduct a comprehensive black-box evaluation of three attack categories on recent omni-models (e.g., Qwen2.5-Omni, Qwen3-Omni): text attacks, text-transferred audio attacks, and audio attacks. We find that text-transferred audio attacks perform comparably to, and often better than, audio-based attacks, exhibiting a clear advantage under audio-only access. This suggests that text-based vulnerabilities play a pivotal role in shaping audio safety risks. Finally, we empirically analyze the relationship between modality alignment and transfer effectiveness across attack methods and models, observing consistent support for the Alignment Curse: tighter modality alignment leads to more effective cross-modality attack transfer.

URL PDF HTML ☆

赞 0 踩 0

2601.06199 2026-06-02 eess.AS cs.AI cs.SD 版本更新

FastSLM: Hierarchical Temporal Abstraction for Efficient Long-Form Speech Adaptation

FastSLM：用于高效长语音自适应的层次时间抽象

Junseok Lee, Sangyong Lee, Chang-Jae Chun

发表机构 * OKESTRO ； Sejong University（世宗大学）

AI总结针对长语音输入中标记爆炸问题，提出FastSLM架构，通过层次时间抽象器（HTA）实现每秒1.67个标记的极端压缩率（减少97%），在显著降低计算量和参数的同时，在长语音基准上达到与最先进模型竞争的性能。

Comments Title updated

2601.19919 2026-06-02 cs.CL cs.AI cs.SD 版本更新

ASKD-Whisper: Adaptive Self-knowledge Distillation for Efficient and Low-Latency Automatic Speech Recognition

ASKD-Whisper: 自适应自知识蒸馏用于高效低延迟自动语音识别

Junseok Lee, Nahun Kim, Sangyong Lee, Chang-Jae Chun

发表机构 * OKESTRO Co., Ltd（OKESTRO公司）； Sejong University（世宗大学）

AI总结提出自适应自知识蒸馏（ASKD）动态课程框架，通过逐步减少对教师模型的依赖并引入自知识蒸馏阶段，在压缩Whisper模型时实现5倍推理加速和1.07%词错误率降低。

Comments Title and content have been updated

详情

AI中文摘要

知识蒸馏（KD）是将大规模基础模型压缩为可部署架构的最有效范式之一。在自动语音识别（ASR）背景下，先前研究主要侧重于强制学生模型严格模仿大型教师模型的预测分布。然而，这种静态依赖通常存在固有权衡：虽然学生快速获得基本语言表示，但同时继承了教师特定领域的盲点和过度自信的幻觉，导致分布外泛化能力严重下降。为有效缓解此问题，我们提出自适应自知识蒸馏（ASKD），一种动态课程框架。ASKD随着训练进行系统地衰减对教师分布的依赖——从而释放学生独立推理能力——随后采用自知识蒸馏阶段作为结构正则化器。通过应用ASKD，我们将庞大的Whisper架构蒸馏为紧凑变体ASKD-Whisper。在跨多种声学领域的综合评估中，ASKD-Whisper不仅实现了5倍推理延迟加速，还以1.07%更低的词错误率（WER）超越了其教师模型。这些结果表明，ASKD有效防止了教师引起的过拟合，并为可泛化模型压缩建立了新的最先进水平。

英文摘要

Knowledge distillation (KD) is one of the most effective paradigms for compressing large-scale foundation models into deployable architectures. In the context of Automatic Speech Recognition (ASR), previous studies have predominantly focused on forcing the student model to strictly mimic the predictive distribution of a massive teacher model. However, this static dependency often presents an inherent trade-off: while the student rapidly acquires basic linguistic representations, it simultaneously inherits the teacher's domain-specific blind spots and over-confident hallucinations, leading to a severe decline in out-of-distribution generalization capacity. To effectively mitigate this issue, we propose Adaptive Self-Knowledge Distillation (ASKD), a dynamic curriculum framework. ASKD systematically decays the dependency on the teacher's distribution as training progresses-thereby unlocking the student's independent reasoning capacity-and subsequently employs a self-knowledge distillation phase to act as a structural regularizer. By applying ASKD, we distill the massive Whisper architecture into a compact variant, ASKD-Whisper. In our comprehensive evaluations across diverse acoustic domains, ASKD-Whisper not only achieves a 5x speedup in inference latency but also outperforms its teacher model by yielding a 1.07% lower word error rate (WER). These results demonstrate that ASKD effectively prevents teacher-induced overfitting and establishes a new state-of-the-art for generalizable model compression.

URL PDF HTML ☆

赞 0 踩 0

2601.03615 2026-06-02 cs.CL cs.SD eess.AS 版本更新

SARA: Stress Test Reasoning in Audio Deepfake Detection

SARA: 音频深度伪造检测中的压力测试推理

Binh Nguyen, Charles Fleming, Thai Le

发表机构 * Indiana University（印第安纳大学）； Cisco Research（思科研究）

AI总结提出SARA框架，通过声学感知、推理-判决一致性与不和谐三个维度评估音频语言模型在对抗攻击下的推理可靠性，发现声学攻击降低一致性而语言攻击保持一致性但成功率更高，且推理轨迹的文本一致性可作为检测对抗样本的潜在指标。

Comments Preprint for ACL 2026 submission

详情

AI中文摘要

音频语言模型（ALMs）通过提供推理轨迹来透明化其预测，为可解释的音频深度伪造检测（ADD）提供了有前景的转变，超越了黑盒分类器。然而，这种推理可能不支持模型预测，反映出一致性差，或者更糟的是，可能用看似合理但具有误导性的解释来合理化错误预测。此外，ALM推理在对抗攻击下的行为仍未得到充分探索，引发了关于这种解释能力实际可靠性的疑问。为填补这一空白，本研究引入了SARA（音频推理的移位分析），这是一个诊断框架，从三个维度评估ALM推理：声学感知、推理-判决一致性与不和谐。我们针对声学和语言对抗攻击测试了五个开源ALM。结果表明，声学攻击显著降低了推理-判决一致性（平均下降14.20%），经常引发内部逻辑冲突。相反，语言攻击在保持推理一致性的同时实现了更高的攻击成功率。我们进一步证明，生成的推理轨迹的文本一致性也可作为对抗输入的潜在指标，从而无需访问原始声学信号即可有效检测受扰音频（F1为0.78）。这些发现表明，即使最终分类输出受损，推理轨迹仍具有诊断效用。

英文摘要

Audio Language Models (ALMs) offer a promising shift towards explainable audio deepfake detections (ADD), moving beyond \textit{black-box} classifiers by providing transparency to their predictions via reasoning traces. However, such reasoning may not support the model predictions, reflecting poor coherence, or, worse, may rationalize incorrect predictions with plausible but misleading explanation. Moreover, the behavior of ALM reasoning under adversarial attacks remains under-explored, raising questions about the practical reliability of such explanation capabilities. To address this gap, this study introduces \textbf{SARA} (\textbf{S}hift \textbf{A}nalysis of \textbf{R}easoning in \textbf{A}udio), a diagnostic framework that evaluates ALM reasoning across three dimensions: acoustic perception, reasoning-verdict coherence and dissonance. We test five open-source ALMs against both acoustic and linguistic adversarial attacks. We show that acoustic attacks significantly degrade reasoning-verdict coherence (average decrease of 14.20\%), frequently inducing internal logical conflicts. Conversely, linguistic attacks achieve higher attack success rates while maintaining reasoning coherence. We further demonstrate that the textual coherence of generated reasoning traces also serves as a latent indicator of adversarial inputs, enabling effective detection of perturbed audio (0.78 in F1) \textit{without accessing the raw acoustic signal}. These findings suggest that reasoning traces provide diagnostic utility that persists even when final classification outputs are compromised.

URL PDF HTML ☆

赞 0 踩 0

2512.10120 2026-06-02 cs.SD cs.AI 版本更新

VocSim: A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio

VocSim：单源音频中零样本内容身份的无训练基准

Maris Basha, Anja Zai, Sabine Stoll, Richard Hahnloser

发表机构 * University of California, Berkeley（加州大学伯克利分校）； ETH Zurich（苏黎世联邦理工学院）

AI总结提出VocSim，一个无需训练的无标签基准，通过冻结嵌入的几何对齐评估通用音频表示在零样本内容身份识别中的性能，并在多领域音频上取得强结果，同时揭示跨语言泛化差距。

Comments Accepted at ICML 2026. Code: https://github.com/vocsim/benchmark

详情

AI中文摘要

通用音频表示旨在将同一事件的声学可变实例映射到邻近点，在零样本设置中解决内容身份问题。与通过参数更新衡量适应性的监督分类基准不同，我们引入了VocSim，一个无需训练的基准，探测冻结嵌入的内在几何对齐，不更新任何参数也不使用标签（每个子集拟合一个无标签PCA白化以校正各向异性）。VocSim汇集了来自19个语料库的125k个单源片段，涵盖人类语音、动物发声和环境声音，将内容表示与源分离隔离开来（多声道混合超出范围）。我们使用Precision@k评估局部纯度，使用全局分离率（GSR）评估逐点类别分离，并通过相对于经验置换基线的提升进行校准。一个简单的冻结Whisper特征、时频池化和无标签PCA的流程在跨领域上产生了强大的零样本性能，GSR排名稳定（Kendall's tau = 0.60）。然而，在低资源盲语音（Shipibo-Conibo、Chintang）上，局部检索崩溃但仍高于随机水平，暴露了跨语言语音泛化差距。作为外部验证，我们的顶级嵌入预测了鸟类感知相似性，改进了生物声学分类，并在HEAR基准上达到了最先进水平。我们发布了数据、代码和公共排行榜。

英文摘要

General-purpose audio representations aim to map acoustically variable instances of the same event to nearby points, resolving content identity in a zero-shot setting. Unlike supervised classification benchmarks that measure adaptability via parameter updates, we introduce VocSim, a training-free benchmark probing the intrinsic geometric alignment of frozen embeddings, with no parameters updated and no labels used (a label-free PCA whitening is fit per subset to correct anisotropy). VocSim aggregates 125k single-source clips from 19 corpora spanning human speech, animal vocalizations, and environmental sounds, isolating content representation from source separation (polyphonic mixtures are out of scope). We evaluate embeddings with Precision@k for local purity and the Global Separation Rate (GSR) for point-wise class separation, calibrated by lift over an empirical permutation baseline. A simple pipeline of frozen Whisper features, time-frequency pooling, and label-free PCA yields strong zero-shot performance with stable GSR rankings across domains (Kendall's tau = 0.60). However, on blind low-resource speech (Shipibo-Conibo, Chintang), local retrieval collapses while remaining above chance, exposing a cross-lingual speech generalization gap. As external validation, our top embeddings predict avian perceptual similarity, improve bioacoustic classification, and achieve state-of-the-art on the HEAR benchmark. We release data, code, and a public leaderboard.

URL PDF HTML ☆

赞 0 踩 0

2511.13487 2026-06-02 eess.AS cs.LG cs.SD 版本更新

Systematic Evaluation of Time-Frequency Features for Binaural Sound Source Localization

双耳声源定位的时频特征系统评估

Davoud Shariat Panah, Alessandro Ragano, Dan Barry, Jan Skoglund, Andrew Hines

发表机构 * Taighde Éireann – Research Ireland（塔尔德·爱尔兰——爱尔兰研究）

AI总结系统评估不同时频特征组合对双耳声源定位性能的影响，发现精心选择的特征组合（如通道频谱图结合ILD和IPD）可超越增加模型复杂度，为领域特定和通用定位提供实用指导。

Comments Accepted at EUSIPCO 2026

详情

AI中文摘要

本研究对双耳声源定位（SSL）的时频特征设计进行了系统评估，重点关注特征选择如何在多样条件下影响模型性能。我们研究了使用基于幅度特征（幅度频谱图、耳间电平差 - ILD）和基于相位特征（相位频谱图、耳间相位差 - IPD）的各种组合的卷积神经网络（CNN）模型的性能。在域内和域外数据（具有不匹配的头部相关传递函数 - HRTFs）上的评估表明，精心选择的特征组合通常优于增加模型复杂度。虽然诸如ILD + IPD的双特征集足以用于域内SSL，但泛化到多样内容需要更丰富的输入，结合通道频谱图与ILD和IPD。使用最优特征集，我们的低复杂度CNN模型实现了有竞争力的性能。我们的发现强调了特征设计在双耳SSL中的重要性，并为领域特定和通用定位提供了实用指导。

英文摘要

This study presents a systematic evaluation of time-frequency feature design for binaural sound source localization (SSL), focusing on how feature selection influences model performance across diverse conditions. We investigate the performance of a convolutional neural network (CNN) model using various combinations of amplitude-based features (magnitude spectrogram, interaural level difference - ILD) and phase-based features (phase spectrogram, interaural phase difference - IPD). Evaluations on in-domain and out-of-domain data with mismatched head-related transfer functions (HRTFs) reveal that carefully chosen feature combinations often outperform increases in model complexity. While two-feature sets such as ILD + IPD are sufficient for in-domain SSL, generalization to diverse content requires richer inputs combining channel spectrograms with both ILD and IPD. Using the optimal feature sets, our low-complexity CNN model achieves competitive performance. Our findings underscore the importance of feature design in binaural SSL and provide practical guidance for both domain-specific and general-purpose localization.

URL PDF HTML ☆

赞 0 踩 0

2510.01891 2026-06-02 cs.SD cs.AI eess.AS 版本更新

HRTFformer: A Spatially-Aware Transformer for Individual HRTF Upsampling in Immersive Audio Rendering

HRTFformer: 用于沉浸式音频渲染中个体HRTF上采样的空间感知Transformer

Xuyi Hu, Jian Li, Shaojie Zhang, Stefan Goetz, Lorenzo Picinali, Ozgur B. Akan, Aidan O. T. Hogg

发表机构 * SONICOM

AI总结针对个体HRTF测量困难的问题，提出基于Transformer的HRTF上采样架构，利用注意力机制和球谐域处理，结合邻域差异损失，实现高保真HRTF重建。

Comments Accepted to IEEE Transactions on Multimedia 2026

详情

AI中文摘要

个体头相关传输函数（HRTF）正开始被引入许多商业沉浸式音频应用中，对于实现逼真的空间音频渲染至关重要。然而，引入它们的主要顾虑之一是，由于HRTF测量过程的复杂性，大规模创建个体HRTF并不实用。为缓解这一缺点，提出了HRTF空间上采样，旨在减少所需的测量量。尽管先前的工作已通过不同的机器学习方法取得成功，但这些模型通常难以在相邻源方向之间保持局部空间变化模式的长期一致性，以及在高上采样因子下的泛化能力。本文提出了一种新颖的基于Transformer的HRTF上采样架构，利用注意力机制更好地捕捉HRTF球面上的空间相关性。在球谐域中工作，我们的模型从稀疏输入测量中学习重建高分辨率HRTF，精度显著提高。为增强空间一致性，我们引入了邻域差异损失，促进幅度平滑性，从而产生更逼真的上采样。我们使用感知定位模型和客观频谱失真指标评估了我们的方法。实验表明，我们的模型在生成逼真、高保真HRTF方面，在多个评估指标上优于现有方法。

英文摘要

Individual Head-Related Transfer Functions (HRTFs) are starting to be introduced in many commercial immersive audio applications and are crucial for realistic spatial audio rendering. However, one of the main hesitations regarding their introduction is that creating individual HRTFs is impractical at scale due to the complexities of the HRTF measurement process. To mitigate this drawback, HRTF spatial upsampling has been proposed with the aim of reducing the measurements required. While prior work has seen success with different machine learning (ML) approaches, these models often struggle with long-range preservation of local spatial variation patterns across neighbouring source directions and generalization at high upsampling factors. In this paper, we propose a novel transformer-based architecture for HRTF upsampling, leveraging the attention mechanism to better capture spatial correlations across the HRTF sphere. Working in the spherical harmonic (SH) domain, our model learns to reconstruct high-resolution HRTFs from sparse input measurements with significantly improved accuracy. To enhance spatial coherence, we introduce a neighbour dissimilarity loss that promotes magnitude smoothness, yielding more realistic upsampling. We evaluate our method using both perceptual localization models and objective spectral distortion metrics. Experiments show that our model outperforms existing methods across several evaluation metrics in generating realistic, high-fidelity HRTFs.

URL PDF HTML ☆

赞 0 踩 0

2505.18614 2026-06-02 cs.CL cs.LG cs.MM cs.SD eess.AS 版本更新

MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation

MAVL：面向动画歌曲翻译的多语言音视频歌词数据集

Woohyun Cho, Youngmin Kim, Sunghyun Lee, Youngjae Yu

发表机构 * Yonsei University（延世大学）； Seoul National University（首尔国立大学）

AI总结提出首个多语言多模态歌词翻译基准MAVL，并设计音节约束的音视频大语言模型SylAVL-CoT，利用音视频线索和音节约束提升歌词可唱性和翻译准确性。

Comments Accepted to EMNLP 2025, Project Page: https://k1064190.github.io/papers/paper1.html, our codes and datasets are available at https://github.com/k1064190/MAVL

2412.03771 2026-06-02 cs.SD cs.LG eess.AS 版本更新

Embedding-Space Diffusion for Zero-Shot Environmental Sound Classification

嵌入空间扩散用于零样本环境声音分类

Ysobel Sims, Alexandre Mendes, Stephan Chalup

发表机构 * School of Information and Physical Sciences, University of Newcastle, Australia（信息与物理科学学院，新南威尔士大学，澳大利亚）

AI总结本文提出一种基于扩散模型的条件生成方法，用于零样本环境声音分类，在多个音频数据集上平均性能优于现有基线方法。

详情

AI中文摘要

零样本学习通过利用语义信息使模型能够泛化到未见过的类别，弥合训练集和测试集之间类别不重叠的差距。尽管大量研究集中在计算机视觉中的零样本学习，但这些方法在环境音频中的应用仍未被充分探索，现有研究性能较差。在计算机视觉中已证明成功的生成方法在零样本环境声音分类研究中明显缺失。为填补这一空白，本研究探索了环境音频中零样本学习的生成方法。我们改编了两种来自计算机视觉的成功生成模型：交叉对齐和分布对齐变分自编码器（CADA-VAE）以及利用不变侧生成对抗网络（LisGAN）。此外，我们引入了一种以类别辅助数据为条件的新型扩散模型。扩散模型生成的合成嵌入与已见类别嵌入结合，用于训练分类器。在五个环境音频数据集（ESC-50、ARCA23K-FSD、FSC22、UrbanSound8k和TAU Urban Acoustics 2019）和一个音乐分类数据集（GTZAN）上进行了实验。结果表明，扩散模型在六个音频数据集上的平均性能优于所有基线方法。这项工作确立了扩散模型作为零样本学习的一种有前景的方法，并引入了零样本环境声音分类生成方法的第一个基准，为未来研究提供了基础。

英文摘要

Zero-shot learning enables models to generalise to unseen classes by leveraging semantic information, bridging the gap between training and testing sets with non-overlapping classes. While much research has focused on zero-shot learning in computer vision, the application of these methods to environmental audio remains underexplored, with poor performance in existing studies. Generative methods, which have demonstrated success in computer vision, are notably absent from zero-shot environmental sound classification studies. To address this gap, this work investigates generative methods for zero-shot learning in environmental audio. Two successful generative models from computer vision are adapted: a cross-aligned and distribution-aligned variational autoencoder (CADA-VAE) and a leveraging invariant side generative adversarial network (LisGAN). Additionally, we introduced a novel diffusion model conditioned on class auxiliary data. Synthetic embeddings generated by the diffusion model are combined with seen class embeddings to train a classifier. Experiments are conducted on five environmental audio datasets, ESC-50, ARCA23K-FSD, FSC22, UrbanSound8k and TAU Urban Acoustics 2019, and one music classification dataset, GTZAN. Results show that the diffusion model outperforms all baseline methods on average across six audio datasets. This work establishes the diffusion model as a promising approach for zero-shot learning and introduces the first benchmark of generative methods for zero-shot environmental sound classification, providing a foundation for future research.

URL PDF HTML ☆

赞 0 踩 0