arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.18749 2026-05-19 cs.SD cs.CV 版本更新

WavFlow: Audio Generation in Waveform Space

WavFlow：在波形空间中进行音频生成

Feiyan Zhou, Luyuan Wang, Shoufa Chen, Zhe Wang, Zhiheng Liu, Yuren Cong, Xiaohui Zhang, Fanny Yang, Belinda Zeng

发表机构 * Meta AI ； Northeastern University（东北大学）

AI总结本文提出WavFlow框架，直接在原始波形空间生成高保真的音频，无需中间表示，通过波形分块和振幅提升实现稳定优化，通过自动化数据管道生成高质量视频-文本-音频三元组，实验结果显示在视频到音频和文本到音频基准测试中表现优异，证明了无需中间压缩即可实现高质量合成。

Comments Code: https://github.com/facebookresearch/WavFlow

详情

AI中文摘要

现代音频生成主要依赖于潜在空间压缩，引入了额外的复杂性和潜在的信息损失。在本工作中，我们挑战这一范式，提出WavFlow框架，该框架直接在原始波形空间中生成高保真的音频，而无需中间表示。为了克服建模高维和低能量信号的固有困难，我们将音频转换为2D token网格通过波形分块，并引入振幅提升以对齐信号尺度，通过直接x预测在流匹配中实现稳定优化。为了捕捉复杂的语义对齐和时间同步，我们利用自动化数据管道来收集500万高质量的视频-文本-音频三元组，使模型能够从头学习精细的声学模式。实验结果表明，WavFlow在视频到音频基准测试VGGSound（FD_PaSST：59.98，IS_PANNs：17.40，DeSync：0.44）和文本到音频基准测试AudioCaps（FD_PANNs：10.63，IS_PANNs：12.62）中表现竞争，与已有的基于潜在空间的方法相匹配或超过。我们的工作证明了中间压缩不是高质量合成的必要条件，为多模态音频生成提供了一个更简单且可扩展的替代方案。

英文摘要

Modern audio generation predominantly relies on latent-space compression, introducing additional complexity and potential information loss. In this work, we challenge this paradigm with WavFlow, a framework that generates high-fidelity audio directly in raw waveform space without intermediate representations. To overcome the inherent difficulties of modeling high-dimensional and low-energy signals, we reshape audio into 2D token grids through waveform patchify and introduce amplitude lifting to align signal scales, enabling stable optimization via direct x-prediction in flow matching. To capture complex semantic alignment and temporal synchronization, we leverage an automated data pipeline to curate 5 million high-quality video-text-audio triplets, allowing the model to learn fine-grained acoustic patterns from scratch. Experimental results show that WavFlow achieves competitive performance on the video-to-audio benchmark VGGSound (FD_PaSST: 59.98, IS_PANNs: 17.40, DeSync: 0.44) and the text-to-audio benchmark AudioCaps (FD_PANNs: 10.63, IS_PANNs: 12.62), matching or exceeding the performance of established latent-based methods. Our work demonstrates that intermediate compression is not a prerequisite for high-quality synthesis, offering a simpler and more scalable alternative for multimodal audio generation.

URL PDF HTML ☆

赞 0 踩 0

2605.18613 2026-05-19 cs.SD cs.AI 版本更新

SAME: A Semantically-Aligned Music Autoencoder

SAME：一种语义对齐的音乐自编码器

Julian D. Parker, Zach Evans, CJ Carr, Zachary Zukowski, Josiah Taylor, Matthew Rice, Jordi Pons

发表机构 * Stability AI

AI总结该研究提出SAME自编码器，通过结合Transformer架构和语义正则化方法，实现了4096倍的时间压缩比，同时保持重建质量和生成性能。

2605.18409 2026-05-19 cs.SD 版本更新

EnvTriCascade: An Environment-Aware Tri-Stage Cascaded Framework for ESDD2 2026 Challenge

EnvTriCascade: 一个面向环境的三阶段级联框架用于ESDD2 2026挑战

Hengyan Huang, Xiaoxuan Guo, Jiayi Zhou, Yuankun Xie, Jian Liu, Haonan Cheng, Long Ye, Qin Zhang

发表机构 * a State Key Lab. of Media Convergence ； Communication, Communication University of China, Beijing, China ； c Key Lab. of Media Audio \& Video, Ministry of Education, Communication University of China, Beijing, China

AI总结本文提出EnvTriCascade框架，通过三阶段级联结构和环境感知方法，有效区分真实语音和 manipulated 混合信号，在ESDD2挑战中取得高宏F1分数。

详情

AI中文摘要

在现实场景中，ADD已从仅语音伪造发展到更具有挑战性的组件级设置，其中语音和环境声音可能被独立操控。为解决这一问题，我们提出EnvTriCascade，一个面向环境的三阶段级联框架用于ESDD2挑战。首先，一个混合一致性检测器提供二元先验以区分原始录音和 manipulated 混合物，校准最终决策。其次，两个互补的五类检测器，利用SSLAM+XLS-R和EAT-large+XLS-R表示，提取鲁棒的多分支特征，通过跨分支注意力门控分类器整合。为了增强对不同混合条件的鲁棒性，我们引入RawBoost增强。仅在官方CompSpoofV2数据集上训练，我们的系统在测试集上获得宏F1分数0.8266，显著优于官方基线，并在挑战中排名第二。

英文摘要

ADD in real-world scenarios has evolved from speech-only spoofing to more challenging component-level settings, where speech and environmental sounds may be independently manipulated. To tackle this, we propose EnvTriCascade, an Environment-Aware Tri-Stage Cascaded framework for the ESDD2 Challenge. First, a mix-consistency detector provides a binary prior to distinguish original recordings from manipulated mixtures, which calibrates the final decisions. Next, two complementary five-class detectors, leveraging SSLAM+XLS-R and EAT-large+XLS-R representations, extract robust multi-branch features integrated via a cross-branch attention-gated classifier. To enhance robustness against diverse mixing conditions, we incorporate RawBoost augmentation. Trained exclusively on the official CompSpoofV2 dataset, our system achieves a Macro-F1 score of 0.8266 on the test set, significantly outperforming the official baseline and ranking second in the challenge.

URL PDF HTML ☆

赞 0 踩 0

2605.18221 2026-05-19 cs.SD cs.CL cs.CV cs.LG physics.med-ph 版本更新

SIREM: Speech-Informed MRI Reconstruction with Learned Sampling

SIREM: 语音引导的MRI重建与学习采样

Md Hasan, Nyvenn Castro, Daiqi Liu, Lukas Mulzer, Jana Hutter, Jonghye Woo, Moritz Zaiss, Andreas Maier, Paula A. Perez-Toro

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg（埃森哲-埃尔朗根-纽伦堡大学模式识别实验室）； Institute of Radiology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg（埃尔朗根大学医院放射学研究所）； Institut für Informationsverarbeitung, Leibniz Universität Hannover（汉诺威莱比锡大学信息处理研究所）； Department of Radiology, Harvard Medical School and Massachusetts General Hospital（哈佛医学院放射科和麻省总医院）

AI总结本文提出了一种语音引导的MRI重建框架SIREM，通过同步语音作为跨模态先验，利用语音与声音学之间的相关性预测图像内容，从而在更高的吞吐量下实现更合理的解剖结构重建。

详情

AI中文摘要

实时磁共振成像（rtMRI）在语音生产中的应用能够非侵入性地可视化动态声带运动，对语音科学和临床评估具有价值。然而，rtMRI本质上受到空间分辨率、时间分辨率和获取速度之间的权衡限制，常常导致k空间测量不足和重建质量下降。我们提出SIREM，一种利用同步语音作为跨模态先验的MRI重建框架。核心思想是语音期间的声带配置与产生的声音学相关，使图像部分内容可从音频预测。SIREM将每帧建模为音频驱动组件和MRI驱动组件的融合，通过空间加权图。音频分支从语音预测发音器相关结构，而MRI分支从测量的k空间数据重建互补内容。我们进一步引入了可学习的软加权轮廓，使螺旋臂的使用与语音引导融合的交互研究可微分。这产生了一个统一的多模态公式，结合了音频驱动预测、MRI重建和采样适应。我们在USC语音rtMRI基准上评估了SIREM，与标准基线（包括栅格、基于小波的压缩感知和总变分）进行比较。SIREM引入了一种语音引导的重建范式，在比迭代方法高得多的吞吐量下运行，同时保持解剖上合理的声带结构。这些结果为多模态语音引导的rtMRI重建建立了初步基准，并突显了同步语音作为快速重建辅助先验的潜力。源代码可在https://github.com/mdhasanai/SIREM获取。

英文摘要

Real-time magnetic resonance imaging (rtMRI) of speech production enables non-invasive visualization of dynamic vocal-tract motion and is valuable for speech science and clinical assessment. However, rtMRI is fundamentally constrained by trade-offs among spatial resolution, temporal resolution, and acquisition speed, often leading to undersampled k-space measurements and degraded reconstructions. We propose SIREM, a speech-informed MRI reconstruction framework that uses synchronized speech as a cross-modal prior. The central idea is that vocal-tract configurations during speech are correlated with the produced acoustics, making part of the image content predictable from audio. SIREM models each frame as a fusion of an audio-driven component and an MRI-driven component through a spatial weighting map. The audio branch predicts articulator-related structure from speech, while the MRI branch reconstructs complementary content from measured k-space data. We further introduce a learnable soft weighting profile over spiral arms, enabling a differentiable study of how k-space arm usage interacts with speech-informed fusion. This yields a unified multimodal formulation that combines audio-driven prediction, MRI reconstruction, and sampling adaptation. We evaluate SIREM on the USC speech rtMRI benchmark against standard baselines, including gridding, wavelet-based compressed sensing, and total variation. SIREM introduces a speech-informed reconstruction paradigm that operates in a substantially higher-throughput regime than iterative methods while preserving anatomically plausible vocal-tract structure. These results establish an initial benchmark for multimodal speech-informed rtMRI reconstruction and highlight the potential of synchronized speech as an auxiliary prior for fast reconstruction. The source code is available at https://github.com/mdhasanai/SIREM

URL PDF HTML ☆

赞 0 踩 0

2605.18175 2026-05-19 cs.SD 版本更新

Sonalyzer-Moz: A Framework for Analyzing the Structure of Mozart's Sonata Form

Sonalyzer-Moz: 一个用于分析莫扎特奏鸣曲形式结构的框架

Jing Zhao, KokSheik Wong, Vishnu Monn Baskaran, Kiki Adhinugraha, David Taniar

发表机构 * School of Information Technology, Monash University Malaysia（Monash大学马来西亚分校信息科技学院）； Department of Computer Science and Information Technology, La Trobe University（La Trobe大学计算机科学与信息技术系）； Faculty of Information Technology, Monash University（Monash大学信息科技学院）

AI总结本文提出Sonalyzer-Moz框架，通过整合特征聚合与序列建模，实现了对奏鸣曲形式结构的自动分析，并建立了首个大规模标注数据集SoSA-Moz，为系统研究奏鸣曲形式提供了基础。

Comments 6 pages, 2 figures

详情

AI中文摘要

奏鸣曲形式是一种音乐丰富且层级结构复杂的形式，对自动分析提出了重大挑战。尽管近年来音乐结构分析取得了进展，但奏鸣曲形式分析仍处于早期阶段。这主要是由于标注古典音乐结构耗时且需要较高的音乐背景知识。为推动该领域研究，我们编制了SoSA-Moz，这是首个大规模数据集，包含全面的层级结构标注。本工作为系统奏鸣曲形式分析奠定了基础。利用这一新贡献的资源，我们进一步提出了Sonalyzer-Moz，一个专门用于研究复杂奏鸣曲结构的基线模型。该框架整合了特征聚合与序列建模，使其能够捕捉局部特征和高层结构依赖性。实验结果表明，Sonalyzer-Moz能够识别对理解奏鸣曲形式至关重要的上层结构组件边界。因此，该方法首次展示了自动上层分析奏鸣曲形式的有效性，并为未来自动理解奏鸣曲形式的研究提供了稳健的基线，同时推进了古典音乐结构分析的研究。

英文摘要

The sonata form is a musically rich and hierarchically structured form that poses significant challenges for automatic analysis. While music structure analysis has seen strides of progress in recent years, sonata form analysis remains in its early stages. This is largely due to the time-consuming and high barrier of the music background requirement for annotating classical music structures. To advance research in this area, we curated SoSA-Moz, the first large-scale dataset featuring comprehensive hierarchical structure annotations. This work establishes a foundation for systematic sonata form analysis. Leveraging this newly contributed resource, we further propose Sonalyzer-Moz, a baseline model specifically designed for investigating complex sonata structures. This framework integrates feature aggregation with sequential modeling, enabling it to capture both local feature and upper-level structural dependencies. Experiment results show that Sonalyzer-Moz is capable of identifying the components' boundaries of the upper-level structure that are critical to understanding sonata form. Therefore, this method demonstrates, for the first time, the effectiveness of automatic upper-level analysis of sonata form, and provides a robust baseline for future research in the automatic understanding of sonata form while advancing the study of classical music structure analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.18168 2026-05-19 cs.CR cs.SD 版本更新

Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models

声学干扰：利用声学潜在语义进行通用劫持的新范式，针对大型音频语言模型

Yanyun Wang, Yu Huang, Zi Liang, Xixin Wu, Li Liu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； The Hong Kong Polytechnic University（香港理工大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结本文提出了一种新的声学干扰方法，通过利用声学潜在语义对大型音频语言模型进行通用劫持，揭示了安全对齐的脆弱性。

Comments 43rd International Conference on Machine Learning (ICML'26)

详情

AI中文摘要

将音频模态整合到大型音频语言模型（LALMs）中显著扩展了其攻击面。现有的劫持范式主要将音频视为恶意负载的载体，依赖语义优化、声学参数控制或加性扰动来将有害内容嵌入音频信号中。在本文中，我们挑战这种必要性，提出了一种新的范式，其中音频的角色从内容注入转变为安全对齐干扰。我们发现，LALM的安全对齐可以仅通过特定的声学潜在语义（ALS）来破坏，这些是音频生成模型先验中的内在非语言特征。与以往利用显式声学参数仅用于风格化恶意音频的研究不同，我们证明了内容无害但注入了特定ALS的干扰音频可以作为通用劫持触发器。基于这一见解，我们提出了声学干扰攻击（AIA），将攻击载荷与音频解耦。具体而言，AIA使用一组通用、指令中立的干扰音频，使标准恶意文本查询能够绕过安全对齐，而无需实例特定的优化。在10个LALMs上跨五个数据集的广泛实验表明，AIA实现了最先进的攻击成功率。此外，我们的可解释性分析揭示了AIA引起的推理路径漂移，并识别了ALS中的内在有效模式，揭示了LALMs跨模态对齐的根本漏洞。

英文摘要

The integration of audio modality into Large Audio Language Models (LALMs) significantly expands their attack surface. Existing jailbreak paradigms predominantly treat audio as a carrier for malicious payloads, relying on semantic optimization, acoustic parameter control, or additive perturbation to embed harmful content into the audio signal. In this work, we challenge this necessity and propose a new paradigm in which the role of audio shifts from content injection to safety alignment interference. We reveal that LALM safety alignment can be compromised solely by specific Acoustic Latent Semantics (ALS), the underlying paralinguistic features intrinsic to the priors of audio generative models. Distinct from previous works that leverage explicit acoustic parameters to merely style malicious audio, we demonstrate that interference audio, benign in content but infused with specific ALS, can serve as a universal jailbreak trigger. Leveraging this insight, we propose the Acoustic Interference Attack (AIA), which decouples the attack payload from the audio. Specifically, AIA employs a set of universal, instruction-neutral interference audio, enabling standard malicious text queries to bypass safety alignment without instance-specific optimization. Extensive experiments on 10 LALMs across five datasets demonstrate that AIA achieves the state-of-the-art attack success rate. Furthermore, our interpretability analysis uncovers the inference path drift induced by AIA and identifies the inherent effective patterns within ALS, revealing the fundamental vulnerability of cross-modal alignment in LALMs.

URL PDF HTML ☆

赞 0 踩 0

2601.09413 2026-05-19 cs.SD cs.AI cs.CL cs.MA eess.AS 版本更新

Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception

Speech-Hands: 一种基于自我反思的语音代理方法用于语音识别和多感知音频推理

Zhen Wan, Chao-Han Huck Yang, Jinchuan Tian, Hanrong Ye, Ankita Pasad, Szu-wei Fu, Arushi Goel, Ryo Hachiuma, Shizhe Diao, Kunal Dhawan, Sreyan Ghosh, Yusuke Hirota, Zhehuai Chen, Rafael Valle, Chenhui Chu, Shinji Watanabe, Yu-Chiang Frank Wang, Boris Ginsburg

发表机构 * NVIDIA ； Kyoto University（京都大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本文提出Speech-Hands框架，通过自我反思决策机制解决语音识别和外部声音理解任务中的信任问题，提升了模型在多任务音频推理中的准确性和鲁棒性。

Comments Accepted to ACL 2026. Oral Presentation. Code: https://github.com/YukinoWan/Speech-Hands OpenClaw Branch: https://github.com/openclaw/openclaw/pull/69073

详情

AI中文摘要

我们介绍了一种语音代理框架，该框架学习了一种关键的全方位理解技能：知道何时信任自身，何时咨询外部音频感知。我们的工作受到一个关键但反直觉的发现的启发：简单地在语音识别和外部声音理解任务上微调全方位模型往往会降低性能，因为模型容易被噪声假说误导。为了解决这个问题，我们的框架Speech-Hands将问题重新表述为一个显式的自我反思决策。这个可学习的反思原语在防止模型被错误的外部候选干扰方面证明是有效的。我们展示了这种代理行为机制能够自然地从语音识别推广到复杂的多选音频推理。在OpenASR排行榜上，Speech-Hands在七个基准测试中比强大的基线高出12.1%的WER。该模型在音频问答决策中也实现了77.37%的准确率和高F1分数，展示了在多样化的音频问答数据集上的鲁棒性和可靠性。通过统一感知和决策，我们的工作为更可靠和稳健的音频智能提供了实用路径。

英文摘要

We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, Speech-Hands consistently outperforms strong baselines by 12.1% WER on seven benchmarks. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets. By unifying perception and decision-making, our work offers a practical path toward more reliable and resilient audio intelligence.

URL PDF HTML ☆

赞 0 踩 0

2512.01537 2026-05-19 cs.SD cs.AI cs.IT cs.LG eess.SP math.IT 版本更新

Two-Dimensional Quantization for Geometry-Aware Audio Coding

二维量化用于几何感知的音频编码

Tal Shuster, Eliya Nachmani

发表机构 * School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Be’er Sheva, Israel（电气与计算机工程学院，内盖夫本· Gurion大学，贝尔谢巴，以色列）

AI总结本文提出了一种二维量化方法Q2D2，通过将特征对投影到结构化的2D网格上，提高了音频压缩效率，同时保持了最先进的重建质量。

Comments Accepted to ICML 2026

详情

AI中文摘要

最近的神经音频编解码器在重建质量上取得了显著成就，通常依赖于残差向量量化（RVQ）、向量量化（VQ）和有限标量量化（FSQ）等量化方法。然而，这些量化技术限制了潜在空间的几何结构，使特征之间的相关性捕捉变得更加困难，导致表示学习、代码本利用和令牌速率的效率低下。在本文中，我们引入了二维量化（Q2D2），一种将特征对投影到结构化2D网格（如六边形、菱形或矩形铺砌）并量化到最近网格值的量化方案，从而生成由网格级别乘积定义的隐式代码本，其代码本大小与传统方法相当。尽管其简单的几何公式，Q2D2在音频压缩效率方面有所提升，具有低令牌速率和高代码本利用率，同时保持了最先进的重建质量。具体而言，Q2D2在语音、音频和音乐领域广泛实验中，在各种客观和主观重建度量上实现了具有竞争力甚至更优的性能。全面的消融研究进一步证实了我们设计选择的有效性。

英文摘要

Recent neural audio codecs have achieved impressive reconstruction quality, typically relying on quantization methods such as Residual Vector Quantization (RVQ), Vector Quantization (VQ) and Finite Scalar Quantization (FSQ). However, these quantization techniques limit the geometric structure of the latent space, make it harder to capture correlations between features leading to inefficiency in representation learning, codebook utilization and token rate. In this paper we introduce Two-Dimensional Quantization (Q2D2), a quantization scheme in which feature pairs are projected onto structured 2D grids, such as hexagonal, rhombic, or rectangular tiling and quantized to the nearest grid values, yielding an implicit codebook defined by the product of grid levels, with codebook sizes comparable to conventional methods. Despite its simple geometric formulation, Q2D2 improves audio compression efficiency, with low token rates and high codebook utilization while maintaining state of the art reconstruction quality. Specifically, Q2D2 achieves competitive to superior performance in various objective and subjective reconstruction metrics, across extensive experiments in speech, audio and music domains compared to state of the art models. Comprehensive ablation studies further confirm the effectiveness of our design choices.

URL PDF HTML ☆

赞 0 踩 0

2401.09512 2026-05-19 cs.SD eess.AS 版本更新

MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

MLAAD：多语言音频防伪数据集

Nicolas M. Müller, Piotr Kawa, Wei Herng Choong, Edresson Casanova, Eren Gölge, Thorsten Müller, Piotr Syga, Philip Sperl, Konstantin Böttinger

发表机构 * Fraunhofer AISEC（弗劳恩霍夫人工智能安全研究所）； Resemble AI ； Wrocław University of Science and Technology（沃拉夫大学）； Thorsten-Voice

AI总结本文提出多语言音频防伪数据集（MLAAD）版本10，包含175个文本到语音（TTS）模型，总计1002.9小时的合成语音，涵盖54种语言，用于训练和评估音频深度伪造检测模型，并展示了其在多个数据集上的优越性能。

Comments IJCNN 2024

详情

AI中文摘要

本文提出了多语言音频防伪数据集（MLAAD）版本10，这是一个用于训练和评估音频深度伪造检测模型的合成音频数据集。该数据集包含175个文本到语音（TTS）模型，总计1002.9小时的合成语音，涵盖54种不同的语言。为了评估该数据集，我们使用MLAAD训练了三种最先进的深度伪造检测模型，并观察到其在作为训练资源时，比InTheWild和FakeOrReal等类似数据集表现更优。此外，与著名的ASVspoof 2019数据集相比，MLAAD证明是一种互补的资源。在八个数据集上的测试中，MLAAD和ASVspoof 2019相互超越，各自在四个数据集上表现突出。通过发布该数据集并提供经过训练的模型通过交互式网络服务器访问，我们旨在普及反伪造技术，使其不仅限于专家领域，并为全球对抗音频伪造和深度伪造做出贡献。

英文摘要

This paper presents the Multi-Language Audio Anti-Spoofing Dataset (MLAAD), version 10: a dataset of synthetic audio to train and evaluate audio deepfake detection models. It features 175 Text-to-Speech (TTS) models, comprising a total of 1002.9 hours of synthetic voice in 54 different languages. To evaluate this dataset, we train three state-of-the-art deepfake detection models with MLAAD and observe that it demonstrates superior performance to comparable datasets like InTheWild and FakeOrReal when used as a training resource. Moreover, compared to the renowned ASVspoof 2019 dataset, MLAAD proves to be a complementary resource. In tests across eight datasets, MLAAD and ASVspoof 2019 alternately outperformed each other, each excelling on four datasets. By publishing the dataset and making a trained model accessible via an interactive webserver, we aim to democratize anti-spoofing technology, making it accessible beyond the realm of specialists, and contributing to global efforts against audio spoofing and deepfakes.

URL PDF HTML ☆

赞 0 踩 0

2605.18072 2026-05-19 cs.SD 版本更新

MusicDET: Zero-Shot AI-Generated Music Detection

MusicDET: 零样本AI生成音乐检测

Chaolei Han, Hongsong Wang, Jie Gui

发表机构 * School of Cyber Science and Engineering, Southeast University, Nanjing 210096, China（东南大学计算机科学与工程学院，南京210096，中国）； School of Computer Science and Engineering, Southeast University, Nanjing 210096, China（东南大学计算机科学与工程学院，南京210096，中国）； Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China（新一代人工智能技术及其交叉应用关键实验室（东南大学），教育部，中国）； Purple Mountain Laboratories, Nanjing 210000, China（紫金山实验室，南京210000，中国）； Engineering Research Center of Blockchain Application, Supervision And Management (Southeast University), Ministry of Education, China（区块链应用、监督与管理工程研究中心（东南大学），教育部，中国）

AI总结本文提出MusicDET框架，通过频率引导的归一化流模型在无生成样本情况下实现零样本AI生成音乐检测，有效识别非分布音乐信号。

Comments Accepted by ICML 2026

详情

AI中文摘要

检测AI生成的音乐对于保持艺术真实性并防止生成音乐技术的滥用至关重要。然而，现有判别检测器通常依赖生成样本进行训练，当面对未知生成器产生的音乐时，性能会严重下降，限制了其实际应用。为了解决这个问题，我们提出了一个零样本设置用于AI生成音乐检测，其中检测器仅在真实音乐上训练而没有访问任何生成样本。在此设置下，我们提出了MusicDET，一种基于频率引导归一化流的生成无关检测框架，该框架通过概率模型真实音乐特征的分布。通过评估输入样本在学习的真实音乐分布下的似然性，MusicDET能够有效检测非分布音乐信号。在FakeMusicCaps和SONICS数据集上的实验表明，MusicDET在识别之前未见过的模型生成的音乐方面显著优于传统判别检测器。

英文摘要

Detecting AI-generated music is crucial for preserving artistic authenticity and preventing the misuse of generative music technologies. However, existing discriminative detectors typically rely on generated samples during training and often suffer from severe performance degradation when confronted with music produced by unseen generators, which limits their real-world applicability. To address this issue, we formulate a zero-shot setting for AI-generated music detection, where the detector is trained exclusively on real music without access to any generated samples. Under this setting, we propose MusicDET, a generator-agnostic detection framework based on frequency-guided normalizing flows that probabilistically models the distribution of real music features. By evaluating the likelihood of an input sample under the learned real-music distribution, MusicDET enables effective detection of out-of-distribution music signals. Experiments on the FakeMusicCaps and SONICS datasets show that MusicDET consistently outperforms conventional discriminative detectors, particularly when detecting music generated by previously unseen models.

URL PDF HTML ☆

赞 0 踩 0

2605.17991 2026-05-19 cs.SD cs.AI 版本更新

Stable Audio 3

稳定音频3

Zach Evans, Julian D. Parker, Matthew Rice, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons

AI总结稳定音频3提出了一种快速的潜在扩散模型家族，用于可变长度音频生成和编辑，通过高效的潜在空间生成和对抗训练提升了生成质量和效率。

Comments Training code: https://github.com/Stability-AI/stable-audio-tools Inference and weights: http://github.com/Stability-AI/stable-audio-3

详情

AI中文摘要

Stable Audio 3 是一组快速的潜在扩散模型（小、中、大）用于可变长度音频生成和编辑。由于我们的模型可以生成几分钟的音频，可变长度生成对于避免生成完整长度音频以生成短声音的成本至关重要。我们还支持修复，使能够进行有针对性的音频编辑和短录音的延续。我们的潜在扩散模型基于一种新的语义-声学自编码器，该自编码器将音频投影到紧凑的潜在空间中，从而在高效扩散生成的同时保持音频保真度，并在潜在空间中鼓励语义结构。最后，我们通过对抗性后训练来加速推理并提高生成质量，减少推理步骤的数量同时提高保真度和提示的遵循性。Stable Audio 3 模型在授权和Creative Commons数据上进行训练，可在H200 GPU上在2秒内生成音乐和声音，在MacBook Pro M4上在几秒内完成。我们发布了小和中型模型的权重，这些模型可以在消费级硬件上运行，并附带其训练和推理流程。

英文摘要

Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline.

URL PDF HTML ☆

赞 0 踩 0

2605.17737 2026-05-19 cs.SD 版本更新

Profiling the Voice: Speaker-Specific Phoneme Fingerprinting for Speech Deepfake Detection

声纹分析：面向语音深度伪造检测的说话者特定音素指纹

Jun Xue, Tong Zhang, Zhuolin Yi, Yihuan Huang, Yi Chai, Yiyang Zhang, Yanzhen Ren

发表机构 * Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education（航空航天信息安全与可信计算重点实验室，教育部）； School of Cyber Science and Engineering, Wuhan University（武汉大学计算机科学与工程学院）

AI总结本文提出了一种基于音素的语音分析框架PVP，通过微音学建模捕捉说话者特有的发音模式，实现对语音深度伪造的高效检测，并提供细粒度的音素级可解释性。

Comments Accepted by IJCAI 2026

详情

AI中文摘要

生成式人工智能的快速发展使音频深度伪造越来越难以与真实人类语音区分，对公众人物等目标人物构成重大威胁。当前的检测系统主要依赖通用的黑盒模型，无法捕捉说话者特有的发音特征且缺乏可解释性。本文提出Phoneme-based Voice Profiling (PVP)，一种新颖的个性化防御框架。通过将检测范式从宏观语音分析转向微观音学建模，PVP捕捉了目标人物习惯性发音模式下的独特声学分布。具体而言，我们的框架利用轻量级高斯混合模型（GMM）对说话者特定的发音实现进行建模，仅需从真实参考语音中估计。这种设计实现了数据高效的建模，并且能够稳健地泛化到之前未见过的伪造攻击，而无需进行重的伪造特定训练。此外，我们引入了首个大规模的中文目标人物深度伪造数据集以基准测试说话者特定的检测。实验结果表明，PVP在目标人物伪造场景中显著优于最先进的通用检测器，实现了显著的EER降低，同时提供细粒度的音素级可解释性用于法医分析。代码和数据可在：https://github.com/JunXue-tech/PVP 获取。

英文摘要

The rapid advancement of generative AI has made audio deepfakes increasingly indistinguishable from authentic human vocals, posing significant threats to persons-of-interest (POI) such as public figures. Current detection systems primarily rely on generic, black-box models that fail to capture speaker-specific idiosyncratic traits and lack interpretability. In this paper, we propose Phoneme-based Voice Profiling (PVP), a novel personalized defense framework. By shifting the detection paradigm from macro-utterance analysis to micro-phonetic modeling, PVP captures the unique acoustic distributions underlying a POI's habitual articulatory patterns. Specifically, our framework models speaker-specific phonetic realizations using lightweight Gaussian Mixture Models (GMMs) estimated solely from bona fide reference speech. This design enables data-efficient profiling and robust generalization to previously unseen spoofing attacks without requiring heavy spoof-specific training. Furthermore, we introduce the first large-scale Chinese POI deepfake dataset to benchmark speaker-specific detection. Experimental results demonstrate that PVP significantly outperforms state-of-the-art generic detectors in POI spoofing scenarios, achieving substantial EER reductions while providing fine-grained, phoneme-level interpretability for forensic analysis. Code and data are available at: https://github.com/JunXue-tech/PVP

URL PDF HTML ☆

赞 0 踩 0

2605.17512 2026-05-19 eess.AS cs.SD 版本更新

Robust Audio Tagging under Class-wise Supervision Unreliability

在类级别监督不可靠性下的鲁棒音频标签

Yuanbo Hou, Zhaoyi Liu, Tong Ye, Qiaoqiao Ren, Jian Guan, Wenwu Wang, Stephen Roberts

发表机构 * Machine Learning Research Group, Engineering Science, University of Oxford（牛津大学机器学习研究组）； GISP, Harbin Engineering University（哈尔滨工程大学GISP）； EECS, KTH Royal Institute of Technology（皇家理工学院电子工程与计算机科学系）； CVSSP, University of Surrey（萨里大学计算机视觉与模式识别实验室）

AI总结本文针对音频标签任务中类依赖性的监督不可靠性问题，提出了一种新的框架CSU，通过在训练过程中控制类级别的监督强度，提高了模型在不同架构和不同监督不可靠性源下的鲁棒性。

详情

AI中文摘要

弱标签数据集如AudioSet推动了最近的音频标签进展。然而，不同声音类别的注释质量各异。标签可能不完整、模糊或不可靠，这在优化过程中引入了类依赖性的监督偏差。随着真实和生成音频在训练中的混合增加，生成样本不一定始终匹配其预期的语义标签。先前的工作主要针对缺失正标签带来的不可靠监督，而本文针对三种其他不可靠监督源：虚假添加、相似类之间的误分配以及标签证据减弱。这些影响引入了未被大多数现有方法显式建模的类依赖性优化偏差。为了填补这一空白，本文提出了一个类级监督不可靠性（CSU）框架，在训练过程中控制类级别的监督强度。CSU为每个类别学习一个单独的不可靠性参数，并在不改变模型架构或推理过程的情况下降低不可靠监督的权重。为了支持评估，本文还引入了ESC-FreeGen50，一个经过人工验证的50种声音类别的基准，结合了真实和生成音频。在受控基准和AudioSet上的实验表明，CSU在不同架构和不同监督不可靠性源下提高了鲁棒性。结果表明，显式建模监督不可靠性在大规模弱标签训练下的音频标签任务中是一种有效且实用的策略。代码和数据可在：https://github.com/Yuanbo2020/CSU 获取。

英文摘要

Weakly labeled datasets such as AudioSet have driven recent progress in audio tagging. However, annotation quality varies across sound classes. Labels may be incomplete, ambiguous, or unreliable, which introduces class-dependent supervision bias during optimisation. The issue becomes harder as real and generated audio are increasingly mixed in training, and generated samples do not always match their intended semantic labels. Prior work mainly addressed unreliable supervision from missing-positive labels, while this paper targets three other sources of unreliable supervision: spurious additions, misassignments between similar classes, and weakened label evidence. These effects introduce class-dependent optimisation bias that is not explicitly modeled by most existing methods. To bridge this gap, the paper proposes a Class-wise Supervision Unreliability (CSU) framework that controls supervision strength at the class level during training. CSU learns a separate unreliability parameter for each class and down-weights less reliable supervision without changing the model architecture or inference process. To support evaluations, this paper also introduces ESC-FreeGen50, a manually verified benchmark of 50 sound classes that combines real and generated audio. Experiments on controlled benchmarks and AudioSet show that CSU improves robustness across different architectures and different sources of supervision unreliability. The results indicate that explicit class-wise modeling of supervision unreliability is an effective and practical strategy for robust audio tagging under large-scale weakly labeled training. Code and data are available at: https://github.com/Yuanbo2020/CSU

URL PDF HTML ☆

赞 0 踩 0

2605.17488 2026-05-19 cs.CV cs.MM cs.SD 版本更新

Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation

Omni-Customizer: 用于联合音频-视频生成的端到端多模态定制

Yuheng Chen, Qingdong He, Teng Hu, Yuji Wang, Yabiao Wang, Lizhuang Ma, Jiangning Zhang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Zhejiang University（浙江大学）； University of Electronic Science and Technology of China（电子科技大学）

AI总结本文提出Omni-Customizer，一种端到端多模态定制框架，旨在实现精确的多模态身份信息绑定和无缝融合，通过引入Omni-Context Fusion模块和Masked TTS Cross-Attention机制，提升多模态定制生成的性能。

详情

AI中文摘要

联合音频和视频生成的领域已因强大基础模型的出现而发生根本性变革。尽管取得了进展，但实现多模态定制，以在多个相互作用的主体中同时保持视觉身份和语音音色的一致性，仍然鲜有研究。为弥合这一差距，我们提出了Omni-Customizer，一种端到端框架，专门针对多模态身份信息的精确绑定和无缝融合。具体而言，我们引入了Omni-Context Fusion（OCF）模块，该模块能有效丰富基础文本提示，加入密集的多模态身份提示，同时引入Masked TTS Cross-Attention（MTP-CA）机制，专门设计以防止严重的"语音泄漏"问题。在该架构中，我们提出了语义锚定的多模态RoPE（SA-MRoPE），用于将视觉和音频参考标记以及TTS嵌入锚定到其对应的语义描述，从而实现结构化的多模态融合和稳健的身份绑定。此外，我们设计了一种全面的训练策略，结合交错音频-视频调度以快速适应多语言场景而不影响基础先验，以及渐进式内对到跨对课程学习以促进高阶和稳健的身份特征学习。大量实验表明，Omni-Customizer在双模态定制生成中实现了最先进的性能，其在视觉身份相似性、音色一致性、精确音频-视频同步以及整体视频-音频保真度方面均表现出色。

英文摘要

The landscape of joint audio and video generation has been fundamentally transformed by the advent of powerful foundation models. Despite these strides, achieving cohesive multimodal customization for the simultaneous preservation of visual identities and vocal timbres across multiple interacting subjects remains largely underexplored. To bridge this gap, we present Omni-Customizer, an end-to-end framework targeted at the precise binding and seamless fusion of multimodal identity information. Specifically, we introduce an Omni-Context Fusion (OCF) module that effectively enriches the base textual prompt with dense, multimodal identity cues, along with a Masked TTS Cross-Attention (MTP-CA) mechanism explicitly designed to prevent the severe "speech leakage" problem. Within this architecture, we propose Semantic-Anchored Multimodal RoPE (SA-MRoPE) to anchor visual and audio reference tokens, along with TTS embeddings, to their corresponding semantic descriptions, enabling structured multimodal fusion and robust identity binding. Furthermore, we devise a comprehensive training strategy that incorporates interleaved audio-video scheduling to rapidly adapt the audio branch to multilingual scenarios without degrading foundational priors, and a progressive in-pair to cross-pair curriculum to facilitate the learning of high-level and robust identity features. Extensive experiments demonstrate that Omni-Customizer achieves state-of-the-art performance in dual-modal customized generation, excelling across visual identity similarity, timbre consistency, precise audio-video synchronization, and overall video-audio fidelity.

URL PDF HTML ☆

赞 0 踩 0

2605.17405 2026-05-19 cs.SD cs.MM 版本更新

A Distribution Matching Approach to Neural Piano Transcription with Optimal Transport

基于最优传输的神经钢琴转录方法

Weixing Wei, Raynaldi Lalang, Dichucheng Li, Kazuyoshi Yoshii

发表机构 * Graduate School of Informatics, Kyoto University, Japan（京都大学信息科学研究生院，日本）； Graduate School of Engineering, Kyoto University, Japan（京都大学工学研究生院，日本）； Independent Researcher, Hong Kong, China（香港中国独立研究者）

AI总结本文提出将自动钢琴转录视为最优传输问题，而非帧级多标签二分类问题，通过最小化预测音符分布到真实分布的传输成本，提升了时间对齐的感知相关性，并提出了一种带有谐波感知注意力机制的卷积循环神经网络来捕捉音乐中的频谱-时间依赖性。

Comments Accepted to ICASSP2026

2601.03170 2026-05-19 cs.SD 版本更新

TED-TTS: Training-Free Intra-Utterance Emotion and Duration Control for Text-to-Speech Synthesis

TED-TTS: 无需训练的语句内情感和时长控制用于文本到语音合成

Qifan Liang, Yuansen Liu, Ruixin Wei, Nan Lu, Junchuan Zhao, Ye Wang

发表机构 * School of Computing, National University of Singapore（新加坡国立大学计算机学院）

AI总结本文提出TED-TTS，一种无需训练的可控框架，用于预训练零样本TTS，实现语句内情感和时长表达。通过段落感知的情感条件策略和时长控制策略，结合因果掩码和单调流对齐过滤，实现平滑的情感变化并保持全局语义一致性，同时利用大规模多情感和时长标注数据集进行自动提示生成，实验表明其在多情感和时长控制中达到最先进的语句内一致性，同时保持基础TTS模型的语音质量。

Comments 24 pages, 9 figures, 7 tables, 3 lists

详情

AI中文摘要

尽管可控的文本到语音（TTS）已经取得了显著进展，但大多数现有方法仍局限于跨语句级别的控制，由于依赖非公开数据集或复杂的多阶段训练，导致细粒度的语句内表达具有挑战性。在本文中，我们提出TED-TTS，一种无需训练的可控框架，用于预训练零样本TTS，以实现语句内的情感和时长表达。具体而言，我们提出了一种段落感知的情感条件策略，结合因果掩码与单调流对齐过滤，以隔离情感条件并调度掩码转换，从而实现平滑的语句内情感变化，同时保持全局语义一致性。基于此，我们进一步提出了一种段落感知的时长控制策略，结合局部时长嵌入控制与全局EOS对数调节，允许局部时长调整，同时确保全局一致的终止。为了消除对段落级手动提示工程的依赖，我们构建了一个包含30,000个样本的多情感和时长标注文本数据集，以实现基于大语言模型的自动提示生成。广泛的实验表明，我们的无需训练方法不仅在多情感和时长控制中实现了最先进的语句内一致性，还保持了基础TTS模型的语音质量。代码和音频示例可供参考。

英文摘要

While controllable Text-to-Speech (TTS) has achieved notable progress, most existing methods remain limited to inter-utterance-level control, making fine-grained intra-utterance expression challenging due to their reliance on non-public datasets or complex multi-stage training. In this paper, we propose TED-TTS, a training-free controllable framework for pretrained zero-shot TTS to enable intra-utterance emotion and duration expression. Specifically, we propose a segment-aware emotion conditioning strategy that combines causal masking with monotonic stream alignment filtering to isolate emotion conditioning and schedule mask transitions, enabling smooth intra-utterance emotion shifts while preserving global semantic coherence. Based on this, we further propose a segment-aware duration steering strategy to combine local duration embedding steering with global EOS logit modulation, allowing local duration adjustment while ensuring globally consistent termination. To eliminate the need for segment-level manual prompt engineering, we construct a 30,000-sample multi-emotion and duration-annotated text dataset to enable LLM-based automatic prompt construction. Extensive experiments demonstrate that our training-free method not only achieves state-of-the-art intra-utterance consistency in multi-emotion and duration control, but also maintains baseline-level speech quality of the underlying TTS model. Code and audio samples are available.

URL PDF HTML ☆

赞 0 踩 0

2512.23994 2026-05-19 cs.SD cs.AI 版本更新

PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

PhyAVBench: 一个具有挑战性的音频物理敏感性基准，用于物理基础的文本到音频视频生成

Tianxin Xie, Wentao Lei, Kai Jiang, Guanjie Huang, Pengfei Zhang, Chunhui Zhang, Fengji Ma, Haoyu He, Han Zhang, Jiangshan He, Jinting Wang, Linghan Fang, Lufei Gao, Orkesh Ablet, Peihua Zhang, Ruolin Hu, Shengyu Li, Weilin Lin, Xiaoyang Feng, Xinyue Yang, Yan Rong, Yanyun Wang, Zihang Shao, Zelin Zhao, Chenxing Li, Shan Yang, Wenfu Wang, Meng Yu, Dong Yu, Li Liu

发表机构 * HKUST(GZ)（香港科技大学（广州））； Tencent（腾讯）

AI总结本文提出PhyAVBench，一个用于评估文本到音频视频生成、图像到音频视频生成和视频到音频生成模型中音频-物理基础能力的基准，通过引入新的数据集和评估方法，揭示了当前模型在物理合理音频生成方面的不足。

Comments 6 major physical dimensions, 41 fine-grained test points, 337 groups of variable-controlled test samples, 11,605 newly recorded videos

详情

AI中文摘要

文本到音频视频（T2AV）生成在影视制作和世界建模等应用中至关重要。然而，当前模型往往无法生成物理上合理的音效。先前的基准主要关注音频视频时间同步，而忽视了对音频-物理基础的显式评估，从而限制了对物理合理音频视频生成的研究。为了解决这个问题，我们提出了PhyAVBench，这是第一个系统评估T2AV、I2AV和V2A模型音频-物理基础能力的基准。PhyAVBench提供PhyAV-Sound-11K，一个包含来自184名参与者25.5小时11,605个可听视频的新数据集，以确保多样性和避免数据泄漏。它包含337对提示组，具有受控的物理变化，驱动声音差异，每个组平均有17个视频，涵盖6个音频-物理维度和41个细粒度测试点。每个提示对都标注了其声音差异背后的物理因素。重要的是，PhyAVBench利用配对文本提示来评估这一能力。我们称这种评估范式为音频-物理敏感性测试（APST），并引入了一个新的指标，对比物理响应分数（CPRS），用于量化生成视频与现实世界对应物之间的声音一致性。我们对17种最先进的模型进行了全面评估。我们的结果表明，即使领先的商业模型在基本的音频物理现象上也存在问题，揭示了超出音频视频同步之外的关键差距，并指明了未来的研究方向。我们希望PhyAVBench能为推进物理基础的音频视频生成提供基础。提示、真实值和生成视频样本可在https://github.com/imxtx/PhyAVBench上获得。

英文摘要

Text-to-audio-video (T2AV) generation is central to applications such as filmmaking and world modeling. However, current models often fail to produce physically plausible sounds. Previous benchmarks primarily focus on audio-video temporal synchronization, while largely overlooking explicit evaluation of audio-physics grounding, thereby limiting the study of physically plausible audio-visual generation. To address this issue, we present PhyAVBench, the first benchmark that systematically evaluates the audio-physics grounding capabilities of T2AV, image-to-audio-video (I2AV), and video-to-audio (V2A) models. PhyAVBench offers PhyAV-Sound-11K, a new dataset of 25.5 hours of 11,605 audible videos collected from 184 participants to ensure diversity and avoid data leakage. It contains 337 paired-prompt groups with controlled physical variations that drive sound differences, each grounded with an average of 17 videos and spanning 6 audio-physics dimensions and 41 fine-grained test points. Each prompt pair is annotated with the physical factors underlying their acoustic differences. Importantly, PhyAVBench leverages paired text prompts to evaluate this capability. We term this evaluation paradigm the Audio-Physics Sensitivity Test (APST) and introduce a novel metric, the Contrastive Physical Response Score (CPRS), which quantifies the acoustic consistency between generated videos and their real-world counterparts. We conduct a comprehensive evaluation of 17 state-of-the-art models. Our results reveal that even leading commercial models struggle with fundamental audio-physical phenomena, exposing a critical gap beyond audio-visual synchronization and pointing to future research directions. We hope PhyAVBench will serve as a foundation for advancing physically grounded audio-visual generation. Prompts, ground-truth, and generated video samples are available at https://github.com/imxtx/PhyAVBench.

URL PDF HTML ☆

赞 0 踩 0

2605.17181 2026-05-19 cs.SD cs.AI 版本更新

MusicSynth: An Automated Pipeline for Generating Violin Fingerboard Animations from Sheet Music Using Optical Music Recognition

MusicSynth: 一种用于从乐谱生成小提琴指板动画的自动化流水线

Abhimanyu Kaushik

发表机构 * Independent Researcher（独立研究者）； Trophy Club, Texas（德克萨斯奖杯俱乐部）

AI总结该研究提出了一种自动化流程，通过光学音乐识别技术将乐谱转换为小提琴指板动画，其核心方法是整合三个开源工具，并通过自定义的查找表将音乐音符映射到小提琴的弦和指位。

Comments 12 pages, 4 figures

详情

AI中文摘要

学习小提琴比看起来更困难。与钢琴键或吉他品相比，小提琴琴颈上没有任何标记，因此初学者无法通过观察来确定每个手指应放置的位置。MusicSynth是一种开源的网页工具，旨在解决这个问题：用户上传任何小提琴乐谱的照片（或数字乐谱文件），系统会自动生成一个视频，显示带有每个音符高亮的小提琴指板——无需安装软件，也不需要手动输入音符。该系统将三个现有的开源工具连接成一个流水线：光学音乐识别（OMR）库从上传的图像中读取音符，MusicXML解析器从数字乐谱中提取时间信息，视频渲染器逐帧绘制指板。唯一从头开始构建的部分是将每个音乐音符映射到小提琴弦和指位的查找表。在110个公共领域小提琴乐谱上测试，MusicSynth在清洁打印乐谱中正确识别了91.2%的音符，并在获得数字乐谱文件时正确分配指位99.1%的时间。据作者所知，目前没有其他免费工具可以自动将乐谱图像转换为动画小提琴指板教程。

英文摘要

Learning the violin is harder than it looks. Unlike piano keys or guitar frets, the violin neck has no markings at all, so a beginner cannot tell by looking where to place each finger. MusicSynth is an open-source web tool that tries to fix that: user uploads a photo of any violin sheet music (or a digital score file), and the system automatically produces a video showing a violin fingerboard with each note highlighted at the right moment -- no software to install, no manual note entry required. The system connects three existing open-source tools into one pipeline: an optical music recognition (OMR) library reads the notes from the uploaded image, a MusicXML parser extracts timing information from digital scores, and a video renderer draws the fingerboard frame by frame. The only part built from scratch is the lookup table that maps each musical note to a string and finger position on the violin. Tested across 110 public-domain violin scores, MusicSynth correctly identified 91.2\,\% of notes in clean printed music and assigned the right finger position 99.1\,\% of the time when given a digital score file. To the author's knowledge, no freely available tool currently turns a sheet music image into an animated violin fingerboard tutorial automatically and in a single browser-based step.

URL PDF HTML ☆

赞 0 踩 0

2605.17085 2026-05-19 cs.SD cs.LG eess.AS 版本更新

Taming Audio VAEs via Target-KL Regularization

通过目标KL正则化驯服音频VAE

Prem Seetharaman, Rithesh Kumar

发表机构 * Adobe Research（Adobe研究院）

AI总结本文提出通过压缩率调节和目标KL正则化训练音频VAE，以解决在音频生成任务中VAE正则化带来的过正则化与欠正则化之间的平衡问题，并构建了音频VAE的率失真曲线。

Comments Accepted at ICASSP 2026 (Barcelona, Spain, 3-8 May 2026). 5 pages, 1 figure, 3 tables

详情

DOI: 10.1109/ICASSP55912.2026.11460662
Journal ref: Proc. ICASSP 2026

AI中文摘要

潜在扩散模型已成为许多生成任务，如音频生成（如文本到音频、文本到音乐和文本到语音）中的主导范式。潜在扩散模型的关键组成部分是一个自动编码器（VAE），它将高维信号压缩成低帧率的连续表示，以利于后续预测。正则化这些VAE具有挑战性，因为存在过度正则化（输出质量差）和欠正则化（难以预测）的潜在表示之间的权衡。我们提出一个框架来研究这种权衡，通过压缩率调节和通过目标KL正则化训练音频VAE。这使得可以直接与已研究的离散神经音频编解码器模型进行比较，并构建音频VAE的率失真曲线。我们评估了目标KL正则化对文本到声音生成的影响，并发现扫掠压缩率有助于确定最佳生成设置。

英文摘要

Latent diffusion models have emerged as the dominant paradigm for many generation tasks including audio generation such as text-to-audio, text-to-music and text-to-speech. A key component of latent diffusion is an autoencoder (VAE) that compresses high-dimensional signals into a low frame rate continuous representation that is conducive for downstream prediction. Regularizing these VAEs is challenging, as there is a trade-off between over-regularized (poor output quality) and under-regularized (difficult to predict) latent representations. We propose a framework for studying this trade-off through compression and train Audio VAEs at specific bitrates via target-KL regularization. This allows direct comparison to well-studied discrete neural audio codec models, and the construction of rate-distortion curves for audio VAEs. We evaluate the impact of target-KL regularization on text-to-sound generation and find that sweeping compression rates is helpful in identifying the optimal generation setting.

URL PDF HTML ☆

赞 0 踩 0

2605.16878 2026-05-19 cs.SD 版本更新

Speaker-Disentangled Remote Speech Detection of Asthma and COPD Exacerbations

基于语音的哮喘和COPD加重症远程检测中的说话者解耦

Yuyang Yan, Sami O. Simons, Visara Urovi

发表机构 * Institute of Data Science, Maastricht University（数据科学研究所，马斯特里赫特大学）； NUTRIM Research Institute of Nutrition and Translational Research in Metabolism, Faculty of Health Medicine and Life Sciences, Maastricht University（营养与代谢转化研究营养研究所，健康医学与生命科学学院，马斯特里赫特大学）； Maastricht University Medical Centre（马斯特里赫特大学医学中心）

AI总结本文提出了一种对抗学习架构，用于解耦与病理相关的语音特征与说话者身份属性，以提高哮喘和COPD加重症的检测性能和患者隐私保护，同时通过SHAP分析量化了语音特征对病理相关预测的贡献。

详情

AI中文摘要

哮喘和慢性阻塞性肺病（COPD）加重症的早期检测对于及时干预至关重要。语音已成为一种有前途的工具，用于连续、非侵入性地监测呼吸系统疾病。然而，语音信号本质上包含可识别说话者属性，这可能会主导模型预测，从而影响诊断性能和患者隐私。此外，在呼吸系统疾病监测中，与呼吸疾病和说话者身份相关的声学特征仍不明确。我们提出了一种对抗学习架构，以解耦与病理相关的语音模式与说话者可识别属性。该框架优化了两个临床分层任务：（i）呼吸状态分类（稳定 vs. 加重）和（ii）加重类型分类（哮喘加重 vs. COPD加重）。通过基于梯度反转的对抗训练抑制说话者身份。为了提高临床可解释性，我们采用SHapley Additive exPlanations（SHAP）来量化语音特征对病理相关预测与说话者身份的贡献。在TACTICAS数据集上，我们的方法在两个任务上均优于单任务基线。对于呼吸状态任务（稳定 vs. 加重），AUC从0.897提高到0.910。对于加重类型任务（哮喘加重 vs. COPD加重），AUC从0.674提高到0.793。同时，J-ratio降低，证实了有效抑制说话信息。SHAP分析揭示了语音特征对两个任务的贡献。在Bridge2AI-Voice数据集上的外部验证进一步证明了性能的持续改进和说话者依赖性的降低，确认了跨数据集的泛化能力。

英文摘要

Early detection of exacerbations in asthma and chronic obstructive pulmonary disease (COPD) is important for timely intervention. Speech has emerged as a promising tool for continuous, non-invasive respiratory disease monitoring. However, speech signals inherently carry speaker-identifiable attributes that may dominate model predictions, which may compromise both diagnosis performance and patient privacy. Furthermore, the acoustic features associated with respiratory disease and speaker identity remain unclear in respiratory disease monitoring. We propose an adversarial learning architecture that disentangles pathology-related acoustic patterns from speaker-identifiable attributes. The framework optimizes two clinically hierarchical tasks: (i) respiratory status classification (stable vs. exacerbated) and (ii) exacerbation type classification (asthma exacerbation vs. COPD exacerbation). Speaker identity is suppressed through gradient reversal-based adversarial training. To enhance clinical interpretability, we employ SHapley Additive exPlanations (SHAP) to quantify the contributions of acoustic features to pathology-related predictions versus speaker identity. On the TACTICAS dataset, our method outperforms the single-task baseline across both tasks. For the respiratory status task (stable vs. exacerbated), the AUC improves from 0.897 to 0.910. For the exacerbation type task (asthma exacerbation vs. COPD exacerbation), the AUC increases from 0.674 to 0.793. Concurrently, the J-ratio decreases, confirming effective suppression of speaker information. SHAP analysis reveals the contributions of the acoustic features to both tasks. External validation on the Bridge2AI-Voice dataset further demonstrates consistent performance improvement and reduced speaker dependency, confirming cross-dataset generalizability.

URL PDF HTML ☆

赞 0 踩 0

2605.16717 2026-05-19 physics.geo-ph cs.SD 版本更新

Radial-Component Predominant-Mode Inversion of Rayleigh Waves: Application to DAS-based Site Characterization

Rayleigh波径向分量主导模式反演：用于基于DAS的场地表征应用

Mrinal Bhaumik, Brady R. Cox

发表机构 * Department of Civil and Environmental Engineering, Utah State University（土木与环境工程系，犹他州立大学）

AI总结本文提出了一种用于基于DAS的表面波分析的径向分量主导模式反演框架，通过考虑源-接收器方向性和Rayleigh波径向分量的模态敏感性，实现了对径向分量散射数据的组件一致解释，从而提高剪切波速度剖面的准确性。

详情

AI中文摘要

分布式声传感（DAS）已发展为一种变革性的技术，用于近表面场地表征。当沿着光纤激活垂直源时，DAS仅测量Rayleigh波运动的径向分量。从径向分量波形中提取的色散数据可能与从垂直分量测量中获得的数据不同，特别是在复杂的地层条件下。因此，当反演径向分量DAS色散数据以获取准确的剪切波速度（Vs）剖面时，需要一个组件一致的正问题。本研究提出了一种径向分量主导模式（RCPM）反演框架，用于基于DAS的表面波分析，该框架明确考虑了源-接收器的定向性和Rayleigh波径向分量的模态敏感性。所提出的方法将测量的主导径向色散趋势与理论上的最大模态参与度的模态相匹配。结果表明，RCPM框架消除了对显式模态索引的需要，提供了径向分量色散数据的组件一致解释，并显著减少了对主观分析师驱动的模态解释的依赖。RCPM方法通过使用三个合成地面模型和两个现场DAS数据集系统地进行了评估。合成结果表明，在存在强速度对比和速度反转的情况下，垂直分量和径向分量的模态能量分布差异显著，传统反演方法可能误判模态行为，导致Vs剖面的准确性较低。相比之下，RCPM方法能够一致捕捉正确的模态响应，并产生可靠的Vs剖面。对两个现场DAS数据集的应用进一步证明了反演的Vs剖面与独立的侵入式钻孔测量之间的一致性。

英文摘要

Distributed Acoustic Sensing (DAS) has emerged as a transformative technology for near-surface site characterization. When a vertical source is activated along the fiber, DAS measures only the in-line (radial) component of Rayleigh-wave motion. Dispersion data extracted from radial-component waveforms may differ from those obtained from vertical-component measurements, particularly under complex stratigraphic conditions. Hence, a component-consistent forward problem is desired when inverting radial-component DAS dispersion data to retrieve accurate shear wave velocity (Vs) profiles. This study presents a radial-component predominant-mode (RCPM) inversion framework designed for DAS-based surface-wave analysis that explicitly accounts for source-receiver directivity and modal sensitivity of the Rayleigh-wave radial component. The proposed approach matches measured dominant radial dispersion trends with the theoretical mode exhibiting the maximum modal participation. As a result, the RCPM framework eliminates the need for explicit modal indexing, provides a component-consistent interpretation of radial-component dispersion data, and substantially reduces reliance on subjective analyst-driven modal interpretations. The RCPM approach is systematically evaluated using three synthetic ground models and two field DAS datasets. The synthetic results demonstrate that modal energy distribution differs significantly between vertical and radial components in the presence of strong velocity contrasts and velocity reversals, and that conventional inversion approaches may misinterpret modal behavior, resulting in less accurate Vs profiles. In contrast, the RCPM method consistently captures the correct modal response and yields reliable Vs profiles. Application to two field DAS datasets further demonstrates good agreement between the inverted Vs profiles and independent invasive borehole measurements.

URL PDF HTML ☆

赞 0 踩 0

2605.01235 2026-05-19 cs.SD cs.AI 版本更新

MindMelody: A Closed-Loop EEG-Driven System for Personalized Music Intervention

MindMelody：一种基于EEG的闭环个性化音乐干预系统

Yimeng Zhang, Yueru Sun, Haoyu Gu, Zhanpeng Jin

发表机构 * South China University of Technology（南方科技大学）

AI总结本文提出MindMelody系统，通过EEG信号实时生成个性化音乐，结合Transformer-GNN和RAG-LLM实现情绪感知与音乐生成的闭环控制，提升情感适应性与用户参与度。

详情

AI中文摘要

为应对全球心理健康问题日益严峻的挑战，音乐干预因其非侵入性和成本效益而受到广泛关注，用于情绪调节和心理压力缓解。然而，当前的数字音乐服务依赖静态偏好，无法适应用户瞬时的心理状态。此外，直接将脑电图（EEG）映射到音乐生成仍然具有挑战性，由于配对数据稀缺和缺乏可解释性。为此，我们提出了MindMelody，一个完整的闭环实时系统，用于EEG驱动的个性化音乐干预。MindMelody引入了一个情绪介导的语义桥梁。具体而言，混合Transformer-GNN首先将实时EEG信号解码为全局Valence-Arousal状态和局部时间影响轨迹。这些状态随后被输入配备检索增强生成（RAG）的大型语言模型（LLM）以制定结构化干预计划。随后，一种新的分层EEG控制器将全局情感前缀和局部时间指导注入预训练的音乐骨干，实现细粒度可控的音频合成。关键的是，系统集成了一个连续反馈回路，根据用户的EEG动态实时更新生成参数。大量实验表明，MindMelody提高了控制依从性和情感匹配，并在短期聆听设置中获得了更高的感知效用，表明其作为适应性情感感知音乐生成框架的潜力。

英文摘要

Driven by the escalating global burden of mental health conditions, music-based interventions have attracted significant attention as a non-invasive, cost-effective modality for emotion regulation and psychological stress relief. However, current digital music services rely on static preferences and fail to adapt to users' instantaneous psychological states. Furthermore, directly mapping electroencephalography (EEG) to music generation remains challenging due to severe paired-data scarcity and a lack of interpretability. To address these limitations, we propose MindMelody, a fully functional, closed-loop real-time system for EEG-driven personalized music intervention. MindMelody introduces an emotion-mediated semantic bridge. Specifically, a hybrid Transformer-GNN first decodes real-time EEG signals into global Valence-Arousal states and local temporal affect trajectories. These states are then fed into a Retrieval-Augmented Generation (RAG)-equipped Large Language Model (LLM) to formulate structured intervention plans. Subsequently, a novel Hierarchical EEG Controller injects global affect prefixes and local temporal guidance into a pretrained music backbone, enabling fine-grained controllable audio synthesis. Crucially, the system incorporates a continuous feedback loop that updates generation parameters on the fly based on the user's evolving EEG dynamics. Extensive experiments show that MindMelody improves control adherence and emotional alignment, and receives higher perceived helpfulness in a short-term listening setting, suggesting its promise as an adaptive affect-aware music generation framework.

URL PDF HTML ☆

赞 0 踩 0

2509.22061 2026-05-19 eess.AS cs.CL cs.SD 版本更新

Speak Your Mind: The Speech Continuation Task as a Probe of Voice-Based Model Bias

说出你的想法：语音延续任务作为语音基础模型偏见的探测器

Shree Harsha Bokkahalli Satish, Harm Lameris, Olivier Perrotin, Gustav Eje Henter, Éva Székely

AI总结本文首次系统评估语音延续任务中的偏见，探讨性别和音质类型对延续行为的影响，发现模型和性别交互显著，且女性提示更倾向于回归模态音质，揭示语音质量偏见。

Comments 8 pages, 2 figures, Accepted to Identity-Aware AI LREC Workshop 2026

详情

AI中文摘要

语音延续（SC）任务是生成连贯的语音提示扩展，同时保持语义上下文和说话者身份。由于SC受限于单一音频流，它比对话更直接地探测语音基础模型的偏见。本文首次系统评估SC中的偏见，研究性别和音质类型（气音、嘶嘶音、末端嘶嘶音）对延续行为的影响。评估了三个最新模型：SpiritLM（基础和表达型）、VAE-GSLM和SpeechGPT，在说话者相似性、语音质量保持和基于文本的偏见指标上。结果表明，尽管说话者相似性和连贯性仍具挑战性，文本评估揭示了显著的模型和性别交互：一旦连贯性足够高（对于VAE-GSLM），性别效应会在文本指标如代理性和句子极性上显现。此外，延续行为更倾向于回归模态音质，特别是对于女性提示，揭示了系统性的语音质量偏见。这些发现突显了SC作为探测社会相关表征偏见的受控探测器，表明随着延续质量的提高，SC将成为越来越有信息量的诊断工具。

英文摘要

Speech Continuation (SC) is the task of generating a coherent extension of a spoken prompt while preserving both semantic context and speaker identity. Because SC is constrained to a single audio stream, it offers a more direct setting for probing biases in speech foundation models than dialogue does. In this work we present the first systematic evaluation of bias in SC, investigating how gender and phonation type (breathy, creaky, end-creak) affect continuation behaviour. We evaluate three recent models: SpiritLM (base and expressive), VAE-GSLM, and SpeechGPT across speaker similarity, voice quality preservation, and text-based bias metrics. Results show that while both speaker similarity and coherence remain a challenge, textual evaluations reveal significant model and gender interactions: once coherence is sufficiently high (for VAE-GSLM), gender effects emerge on text-metrics such as agency and sentence polarity. In addition, continuations revert toward modal phonation more strongly for female prompts than for male ones, revealing a systematic voice-quality bias. These findings highlight SC as a controlled probe of socially relevant representational biases in speech foundation models, and suggest that it will become an increasingly informative diagnostic as continuation quality improves.

URL PDF HTML ☆

赞 0 踩 0

2505.14066 2026-05-19 eess.AS cs.SD 版本更新

SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement

SeamlessEdit: 一种背景噪声感知的零样本语音编辑方法与上下文增强

Kuan-Yu Chen, Jeng-Lin Li, De-Yan Lu, Jian-Jiun Ding

AI总结本文提出SeamlessEdit框架，通过频率带感知噪声抑制模块和上下文优化策略，解决语音与背景噪声频带未分离的问题，在多维度评估中优于现有方法。

Comments 5 pages, 3 figures accepted to eusipco 2026

2404.00470 2026-05-19 cs.SD cs.LG eess.AS 版本更新

Classification of Short Segment Pediatric Heart Sounds Based on a Transformer-Based Convolutional Neural Network

基于变压器卷积神经网络的短段儿童心音分类

Md Hassanuzzaman, Nurul Akhtar Hasan, Mohammad Abdullah Al Mamun, Khawza I Ahmed, Ahsan H Khandoker, Raqibul Mostafa

AI总结本文研究了用于自动分类心音的最短信号持续时间，采用基于MFCC特征的变压器残差一维卷积神经网络，发现5秒信号能获得93.69%的准确率，而3秒信号信息不足，15秒信号噪声较多。

Comments 16 pages,11 Figures

详情

DOI: 10.1109/ACCESS.2025.3573870
Journal ref: IEEE Access, vol. 13, pp. 93852-93868, 2025

AI中文摘要

先天性心脏病（CHDs）是由于心脏和大血管结构缺陷导致的先天异常。PCG能提供关于心脏机械传导系统的重要信息，并指出与不同CHD类型相关的特定模式。本研究旨在调查自动分类心音所需的最短信号持续时间。此外，研究还探讨了最佳信号质量评估指标（RMSSD和ZCR值）。基于MFCC特征构建了变压器残差一维卷积神经网络，用于分类心音。研究显示，0.4是RMSSD和ZCR指标获得合适信号的理想阈值。此外，5秒信号是有效心音分类所需的最小信号长度。研究还表明，较短的信号（3秒心音）无法准确分类，而较长的信号（15秒心音）可能包含更多噪声。5秒信号在区分心音方面获得了最佳准确率93.69%。

英文摘要

Congenital anomalies arising as a result of a defect in the structure of the heart and great vessels are known as congenital heart diseases or CHDs. A PCG can provide essential details about the mechanical conduction system of the heart and point out specific patterns linked to different kinds of CHD. This study aims to investigate the minimum signal duration required for the automatic classification of heart sounds. This study also investigated the optimum signal quality assessment indicator (Root Mean Square of Successive Differences) RMSSD and (Zero Crossings Rate) ZCR value. Mel-frequency cepstral coefficients (MFCCs) based feature is used as an input to build a Transformer-Based residual one-dimensional convolutional neural network, which is then used for classifying the heart sound. The study showed that 0.4 is the ideal threshold for getting suitable signals for the RMSSD and ZCR indicators. Moreover, a minimum signal length of 5s is required for effective heart sound classification. It also shows that a shorter signal (3 s heart sound) does not have enough information to categorize heart sounds accurately, and the longer signal (15 s heart sound) may contain more noise. The best accuracy, 93.69%, is obtained for the 5s signal to distinguish the heart sound.

URL PDF HTML ☆

赞 0 踩 0

2605.16539 2026-05-19 cs.SD physics.data-an 版本更新

vega-mir: An information-theoretic Python toolkit for symbolic music, with applications to harmonic graphs and rubato spectra

vega-mir：一个信息论视角的Python工具包，用于符号音乐分析，应用于和声图谱与rubato频谱

Fred Jalbert-Desforges

AI总结 vega-mir通过信息论和统计指标分析符号音乐数据，展示了两个新案例研究，验证了四种新指标，并揭示了rubato周期性与演奏者表现之间的关系。

Comments 20 pages, 2 figures, companion to arXiv:2605.06685

详情

AI中文摘要

我们介绍了vega-mir，一个开源的Python库，集成了九种信息论和统计度量，用于分析符号音乐语料库，并通过一个经过测试且可引用的API提供。本文展示了其中两个度量在语料库层面的案例研究，这些研究未在上游Cygnus论文中涉及。九种度量中的三种（香农熵、KL散度、Zipf拟合）在配套Cygnus arXiv预印本中部署；两种（基于和弦转换图的网络分析和rubato曲线的频谱分析）在此处完整展示；其余四种（多维基尼系数、卡方平稳性、Higuchi分形维度、区间分布）在捆绑的8位作曲家数据集上与分析锚点进行验证。两个案例研究得出两个主要观察结果。第一，在14位MAESTRO作曲家（N≥10首曲目）中，引力中心节点的PageRank值与边际KL距离在rho=0.61（Spearman，作曲家层面jackknife N=14）相关；分类引力中心身份在语料库中取五种不同值，但本身与边际KL无关（rho=0.13，p=0.21）。第二，在247首巴赫多主版本语料（Schiff、Gould、Richter）中，Gould的周期性比率最高，而非最低，推翻了低标尺rubato被视为“节拍”的陈规定论：Gould的rubato幅度小但时间结构化，中位数主导周期为66拍，而Schiff为102拍，Richter为104拍。

英文摘要

We present vega-mir, an open-source Python library that bundles nine information-theoretic and statistical metrics for the analysis of symbolic music corpora behind a small, tested, citable API, and demonstrates two of them at corpus scale in case studies not addressed by the upstream Cygnus paper. Of the nine metrics, three (Shannon entropy, Kullback-Leibler divergence, Zipfian fits) were deployed in the companion Cygnus arXiv preprint; two (network analysis on chord-transition graphs and spectral analysis of rubato curves) are deployed in full case studies here; the four remaining (multi-dimensional Gini, chi-squared stationarity, Higuchi fractal dimension, interval distribution) are validated against analytic anchors and exercised as sanity checks on a bundled 8-composer dataset. The two case studies yield two main observations. First, on the fourteen MAESTRO composers with N >= 10 pieces, the PageRank value of the gravity-centre node correlates with the marginal Kullback-Leibler distance at rho = 0.61 (Spearman, composer-level jackknife N = 14); the categorical gravity-centre identity takes five distinct values across the corpus but is not itself correlated with marginal KL (rho = 0.13, p = 0.21). Second, on the 247-piece Bach multi-master corpus (Schiff, Gould, Richter), Gould holds the highest periodicity ratio of the three performers, not the lowest, inverting the cliché that low scalar rubato reads as "metronomic": Gould's rubato is small in amplitude but structured in time, with a median dominant period of 66 beats against Schiff's 102 and Richter's 104.

URL PDF HTML ☆

赞 0 踩 0

2605.16403 2026-05-19 cs.CV cs.SD 版本更新

When Vision Speaks for Sound

当视觉为声音说话

Xiaofei Wen, Wenjie Jacky Mo, Xingyu Fu, Rui Cai, Tinghui Zhu, Wendi Li, Yanan Xie, Muhao Chen, Peng Qi

AI总结本文发现视频中MLLMs的音频理解依赖视觉线索而非实际音频流，提出Thud框架通过三种音频编辑干预研究此问题，并提出两阶段对齐方法提升模型性能。

Comments 24 pages, 10 figures

详情

AI中文摘要

尽管视频能力的MLLMs取得显著进展，我们发现其视频中的音频理解往往由视觉驱动：模型依赖视觉线索推断或虚构音频信息，而非验证音频流。此问题出现在最先进的开源全能模型和领先的闭源模型中。我们将此失败模式称为音频-视觉的Clever Hans效应，即模型看似（错误地）音频相关，但实际利用视觉-音频相关性而不验证音频和视觉流是否真正对齐。为系统研究此行为，我们引入Thud，一个基于三种反事实音频编辑的干预驱动探测框架：Shift测试时间同步，Mute测试声音存在，Swap测试音频-视觉一致性。除诊断外，我们进一步研究两阶段对齐方法：干预衍生的偏好对教授音频验证，而事件级通用视频偏好规范模型防止过度专业化。我们的最佳10000样本方法在三个干预维度的平均性能提高28个百分点，同时略微提升通用视频和音频-视觉问答基准性能。

英文摘要

Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in videos is often vision-driven: models rely on visual cues to infer or hallucinate acoustic information, rather than verifying the audio stream. This issue appears across both state-of-the-art open-source omni models and leading closed-source models from providers such as Google and OpenAI. We characterize this failure mode as an audio-visual Clever Hans effect, in which models appear (falsely) audio-grounded, but actually exploit visual-acoustic correlations without verifying whether the audio and visual streams are truly aligned. To systematically study this behavior, we introduce Thud, an intervention-driven probing framework based on three counterfactual audio edits: Shift, which tests temporal synchronization; Mute, which tests sound existence; and Swap, which tests audio-visual consistency. Beyond diagnosis, we further study a two-stage alignment recipe: intervention-derived preference pairs teach audio verification, while event-level general video preferences regularize the model against over-specialization. Our best 10K-sample recipe improves average performance across the three intervention dimensions by 28 percentage points, while slightly improving performance on general video and audio-visual QA benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.16364 2026-05-19 cs.SD cs.AI cs.CL 版本更新

WASIL: In-the-Wild Arabic Spoken Interactions with LLMs

WASIL：真实场景下阿拉伯语口语交互与LLMs

Zien Sheikh Ali, Hamdy Mubarak, Soon-Gyo Jung, Hunzalah Hassan Bhatti, Firoj Alam, Shammur Absar Chowdhury

AI总结本文提出WASIL数据集，包含真实阿拉伯语口语交互数据，包含音频、ASR假说、助手回复及显式喜欢/不喜欢反馈，用于评估LLMs在真实场景下的表现。

Comments Spoken Prompts, Multilingual LLMs, Speech-based Evaluation, Dialectal Speech, Low-resource Languages, Conversational AI, Speech-to-Text QA, Real-world Interaction, Spoken Language Understanding

详情

AI中文摘要

大型语言模型（LLMs）的语音助手通常构建为自动语音识别（ASR）与LLM系统的级联系统，其中识别错误可能扭曲用户意图。不满可能还源于模糊、领域外或非请求的对话轮次，使难以分离ASR影响。我们发布WASIL（在阿拉伯语中表示连接或链接）：包含真实场景下的阿拉伯语口语交互提示，包含音频、ASR假说、助手回复及显式喜欢/不喜欢反馈（8,529轮次；14.2%的不满），再加上一个包含现代标准阿拉伯语（MSA）和四种主要方言及其标签的2,000轮次测试集。我们通过多ASR协议引导的后编辑提供低成本的黄金转录，并标注回答性（可回答、模糊/需要澄清、不支持、非请求/噪声）以区分内在不可回答性与ASR引起的退化。最后，我们描述了使用多裁判LLM评分的可扩展无参考评估方法，用于评估ASR与黄金转录之间的响应。

英文摘要

Large Language Models (LLMs) voice assistants are commonly built as cascaded Automatic Speech recognition (ASR) to LLM systems, where recognition errors can distort user intent. Dislikes may also arise from ambiguous, out-of-domain, or non-request turns, making it hard to isolate ASR effects. We release WASIL (it denotes connection or linking in Arabic): in-the-wild Arabic spoken interaction prompts with audio, ASR hypotheses, assistant responses, and explicit like/dislike feedback (8,529 turns; 14.2% dislikes), plus a 2,000-turn test set covering Modern Standard Arabic (MSA) and four major dialects with their labels. We provide low-cost gold transcripts via multi-ASR agreement-guided post-editing and annotate answerability (answerable, ambiguous/needs-clarification, unsupported, not-a-request/noise) to separate intrinsic unanswerability from ASR-induced degradation. Finally, we describe scalable reference-free evaluation of responses from ASR vs. gold transcripts using multi-judge LLM scoring.

URL PDF HTML ☆

赞 0 踩 0

2507.05688 2026-05-19 eess.AS cs.SD 版本更新

Robust One-step Speech Enhancement via Consistency Distillation

通过一致性蒸馏实现鲁棒的一步语音增强

Liang Xu, Longfei Felix Yan, W. Bastiaan Kleijn

AI总结本文提出ROSE-CD，通过一致性蒸馏生成鲁棒的一步模型，提升语音增强的实时性和性能，实验表明其在语音质量上优于传统多步扩散模型。

Comments Accepted to IEEE WASPAA 2025. 6 pages, 1 figures

详情

Journal ref: Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Lake Tahoe, CA, USA, October 2025

AI中文摘要

扩散模型在语音增强中表现出色，但其实时应用受限于多步迭代采样。一致性蒸馏作为一种替代方法，通过从多步扩散模型中蒸馏出一步一致性模型。然而，蒸馏的一致性模型会偏向教师模型的采样轨迹，导致对噪声不鲁棒且继承教师模型的不准确性。为解决这一限制，本文提出ROSE-CD：通过一致性蒸馏实现鲁棒的一步语音增强。具体而言，我们引入随机学习轨迹以提高模型对噪声的鲁棒性，并联合优化一步模型与两个时域辅助损失，使其能够恢复教师诱导的误差并在整体性能上超越教师模型。这是首个纯一步一致性蒸馏模型，实现了54倍更快的推理速度和优于其30步教师模型的性能。在VoiceBank-DEMAND数据集上的实验表明，所提模型在语音质量上达到最佳水平。此外，其泛化能力在跨领域数据集和真实世界嘈杂录音上得到验证。

英文摘要

Diffusion models have shown strong performance in speech enhancement, but their real-time applicability has been limited by multi-step iterative sampling. Consistency distillation has recently emerged as a promising alternative by distilling a one-step consistency model from a multi-step diffusion-based teacher model. However, distilled consistency models are inherently biased towards the sampling trajectory of the teacher model, making them less robust to noise and prone to inheriting inaccuracies from the teacher model. To address this limitation, we propose ROSE-CD: Robust One-step Speech Enhancement via Consistency Distillation, a novel approach for distilling a one-step consistency model. Specifically, we introduce a randomized learning trajectory to improve the model's robustness to noise. Furthermore, we jointly optimize the one-step model with two time-domain auxiliary losses, enabling it to recover from teacher-induced errors and surpass the teacher model in overall performance. This is the first pure one-step consistency distillation model for diffusion-based speech enhancement, achieving 54 times faster inference speed and superior performance compared to its 30-step teacher model. Experiments on the VoiceBank-DEMAND dataset demonstrate that the proposed model achieves state-of-the-art performance in terms of speech quality. Moreover, its generalization ability is validated on both an out-of-domain dataset and real-world noisy recordings.

URL PDF HTML ☆

赞 0 踩 0

2408.17352 2026-05-19 cs.SD cs.AI eess.AS 版本更新

AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge

AASIST3: 基于SSL特征和额外正则化的KAN增强型AASIST语音深度伪造检测用于ASVspoof 2024挑战

Kirill Borodin, Vasiliy Kudryavtsev, Dmitrii Korzh, Alexey Efimenko, Grach Mkrtchian, Mikhail Gorodnichev, Oleg Y. Rogov

AI总结本文提出AASIST3模型，通过增强现有AASIST框架并引入KAN网络等技术，显著提升了语音伪造检测性能，在封闭条件下达到0.5357的minDCF结果。

Comments 8 pages, 2 figures, 2 tables. Accepted paper at the ASVspoof 2024 (the 25th Interspeech Conference)

详情

AI中文摘要

自动语音验证（ASV）系统通过识别语音特征来识别说话人，广泛应用于金融交易用户认证、智能设备专属访问控制及法医欺诈检测等领域。然而，深度学习算法的进步使得通过文本到语音（TTS）和语音转换（VC）系统生成合成音频成为可能，使ASV系统面临潜在漏洞。为应对这一问题，我们提出了一种名为AASIST3的新型架构。通过增强现有的AASIST框架，引入Kolmogorov-Arnold网络、额外层、编码器和前置强调技术，AASIST3实现了性能的两倍以上提升。它在封闭条件下展示了0.5357的minDCF结果，在开放条件下达到0.1414，显著提高了对合成语音的检测能力，并提升了ASV安全性。新版本的模型已公开在HuggingFace (2026)。

英文摘要

Automatic Speaker Verification (ASV) systems, which identify speakers based on their voice characteristics, have numerous applications, such as user authentication in financial transactions, exclusive access control in smart devices, and forensic fraud detection. However, the advancement of deep learning algorithms has enabled the generation of synthetic audio through Text-to-Speech (TTS) and Voice Conversion (VC) systems, exposing ASV systems to potential vulnerabilities. To counteract this, we propose a novel architecture named AASIST3. By enhancing the existing AASIST framework with Kolmogorov-Arnold networks, additional layers, encoders, and pre-emphasis techniques, AASIST3 achieves a more than twofold improvement in performance. It demonstrates minDCF results of 0.5357 in the closed condition and 0.1414 in the open condition, significantly enhancing the detection of synthetic voices and improving ASV security. \textbf{The new version of the model is publicly available at \href{https://huggingface.co/lab260/Spectra-AASIST3}{\underline{HuggingFace (2026)}}}

URL PDF HTML ☆

赞 0 踩 0