arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.21874 2026-05-25 cs.SD 版本更新

Real-time, EDM-inspired sonification of the activity of a supercomputer

实时、受EDM启发的超级计算机活动声音化

Marco Alunno, Paolo Bientinesi

发表机构 * High-Performance Computing Center North(高性能计算中心北) Umeå University(乌梅大学)

AI总结 本文研究了如何将超计算机实时运行数据通过声音形式进行信息丰富的声学化呈现。研究提出了一种基于电子舞曲(EDM)风格的声学化方法,以持续、清晰且吸引人的方式反映系统各节点的活动状态。该方法强调实时监控而非调试,生成无限延续且风格统一的音乐,将数据声学化与长期监听需求相结合,具有独特创新性。

Comments 7 pages, 2 figures, accepted conference paper

详情
AI中文摘要

本文描述的项目探索了对超级计算机实时接收的数据进行信息性声音化。这些数据捕获了计算机所有节点当前的活动,因此其声音化作为一种持续监控节点行为以及整个系统行为的形式。由于这种监控理论上永无止境,因此产生的声音化必须在音乐上能够通过声音传达信息,同时保持长时间的可理解性和吸引力。我们没有将预定义的音乐风格强加于数据,而是试图找到一种数据本身能够合理支持的音乐风格。从一小部分候选中,我们选择了EDM,因为它是一类流派,其结构和时间特征与连续的数据驱动过程和长期聆听非常契合。通过这种基于风格的方法,本研究建立在计算机数据声音化的悠久传统之上,同时独特地结合了很少同时处理的三个要素:以监控(而非调试)为主要目标、实时(而非事后)数据解释,以及生成几乎无限且风格连贯(而非不协调)的音乐。

英文摘要

The project described in this paper explores the informative sonification of data received in real time from a supercomputer. These data capture the current activities in all the nodes of the computer, therefore, their sonification functions as a form of continuous monitoring of the nodes' behavior and, by extension, of the system as a whole. Because such monitoring is theoretically unending, the resulting sonification must be musically capable of conveying information through sound in a way that remains both intelligible and engaging over long durations. Rather than imposing a predefined musical style onto the data, we sought to identify one which the data themselves could plausibly support. From a small set of candidates, we selected EDM because it is a family of genres whose structural and temporal characteristics align well with continuous, data-driven processes and long-term listening. Through this style-based approach, this research builds on the long tradition of computer data sonification while uniquely combining three elements rarely addressed together: monitoring (rather than debugging) as the primary goal, real-time (rather than post-mortem) data interpretation, and generation of virtually infinite and stylistically coherent (rather than incongruous) music.

2605.23619 2026-05-25 eess.AS cs.SD 版本更新

Frame-Aligned Fusion of Canary and WavLM for Non-Intrusive Intelligibility Prediction of Hearing-Aid-Processed Speech

Canary与WavLM的帧对齐融合用于助听器处理语音的非侵入式清晰度预测

Kazushi Nakazawa

发表机构 * Advanced Media, Inc.(先进媒体公司)

AI总结 本文研究了在无参考条件下预测助听器处理语音可懂度的问题,提出了一种基于Canary和WavLM两个预训练语音编码器的框架对齐融合方法。通过比较多种融合策略,作者发现将WavLM经过可学习的步进卷积处理后,在较粗的Canary时间线上进行融合,能够有效提升预测性能,最终在Eval数据集上取得了较低的RMSE和较高的相关系数。实验分析表明,在池化前建立粗粒度的时序对应关系有助于模型更好地捕捉语音可懂度的关键特征。

Comments 7 pages, 2 figures

详情
AI中文摘要

非侵入式清晰度预测估计听力受损听众对助听器处理语音的理解程度,无需干净参考。我们在第三届清晰度预测挑战赛中研究此任务,使用两个冻结的语音编码器Canary和WavLM。核心问题不仅在于是否应结合互补的预训练表示,还在于它们的交互应发生在何处。我们在共享的左右保留双耳框架下比较了单骨干基线、统一分数平均、池后融合、交叉注意力、帧对齐融合和反向对齐。在比较的系统中,最佳模型使用可学习的步进卷积对WavLM进行时间准备,并在池化前在较粗的Canary时间线上将其与Canary融合,达到Eval RMSE 24.96±0.06和Eval Corr 0.796±0.001。严重性、增强系统、层窗口和时间偏移分析表明,池化前的粗局部时间对应是该任务的有用归纳偏置。

英文摘要

Non-intrusive intelligibility prediction estimates how well hearing-impaired listeners understand hearing-aid-processed speech without a clean reference. We study this task in the 3rd Clarity Prediction Challenge using two frozen speech encoders, Canary and WavLM. The central question is not only whether complementary pretrained representations should be combined, but where their interaction should occur. We compare single-backbone baselines, uniform score averaging, pool-late fusion, cross-attention, frame-aligned fusion, and reverse alignment under a shared left/right-preserving binaural framework. Among the compared systems, the best model temporally prepares WavLM with a learnable strided convolution and fuses it with Canary on the coarser Canary timeline before pooling, reaching Eval RMSE 24.96$\pm$0.06 and Eval Corr 0.796$\pm$0.001. Severity, enhancement-system, layer-window, and temporal-shift analyses indicate that coarse local temporal correspondence before pooling is a useful inductive bias for this task.

2605.23604 2026-05-25 eess.AS cs.SD 版本更新

Word-Level Modeling with Alignment-Aware Acoustic Fusion for Text-Assisted Intelligibility Prediction in Listeners with Hearing Loss

基于对齐感知声学融合的词级建模用于听力损失患者文本辅助可懂度预测

Kazushi Nakazawa

发表机构 * Advanced Media, Inc.(先进媒体公司)

AI总结 本文研究了如何利用文本辅助预测听力障碍者对语音的可懂度,提出了一种基于词级建模和对齐感知声学融合的方法。该方法结合冻结的Whisper编码器分析降质语音,通过条件解码器结合标准文本进行预测,并引入词对齐的局部声学分支与全局声学分支进行校准,提升了预测性能。实验表明,该方法在多项指标上优于基线模型,验证了细粒度预测与对齐融合的有效性。

Comments 7 pages, 2 figures

详情
AI中文摘要

我们针对CPC3中听力受损者的文本辅助语音可懂度预测问题。尽管目标是句子级百分比,但它由参考词识别结果决定。我们将预测建模为参考条件下的词级正确性建模:冻结的Whisper编码器分析退化语音,教师强制解码器以规范转录为条件,句子可懂度通过对有效参考词的预测正确概率取平均得到。为了补充转录条件解码器状态,我们添加了一个基于字符级交叉注意力对齐的词对齐局部声学分支,以及一个用于校准的语句级全局声学分支。在官方评估集上,解码器基线获得RMSE 24.92和相关系数0.795,而联合融合将错误词F1提升至0.778,MCC 0.626,相关系数0.806,RMSE 24.39。使用Whisper medium的类似趋势表明,增益来自预测粒度和对齐感知融合。

英文摘要

We address text-assisted speech intelligibility prediction for hearing-impaired listeners in CPC3. Although the target is a sentence-level percentage, it is determined by reference-word recognition outcomes. We formulate prediction as reference-conditioned word-level correctness modeling: a frozen Whisper encoder analyzes degraded speech, a teacher-forced decoder conditions on the canonical transcript, and sentence intelligibility is obtained by averaging predicted correctness probabilities over valid reference words. To complement transcript-conditioned decoder states, we add a word-aligned local acoustic branch based on character-level cross-attention alignment and an utterance-level global acoustic branch for calibration. On the official evaluation set, the decoder baseline obtains RMSE 24.92 and correlation 0.795, while joint fusion improves to incorrect-word F1 0.778, MCC 0.626, correlation 0.806, and RMSE 24.39. A similar trend with Whisper medium suggests that the gain comes from prediction granularity and alignment-aware fusion.

2605.23416 2026-05-25 cs.CL cs.SD 版本更新

Articulatory strategy as a source of variation in acoustic vowel dynamics

发音策略作为声学元音动态变异的一个来源

Patrycja Strycharczuk, Justin J. H. Lo, Sam Kirkham

发表机构 * Linguistics and English Language, University of Manchester, United Kingdom(曼彻斯特大学语言学与英语语言系,英国) Linguistics and English Language, Lancaster University, United Kingdom(兰卡斯特大学语言学与英语语言系,英国)

AI总结 本研究探讨了发音策略如何影响元音的声学动态变化,揭示了个体发音习惯与音素形式过渡之间的关系。通过分析36位北英格兰英语说话者的舌部超声影像数据,研究发现元音/i/的舌形是影响带有腭化滑音的双元音中共振峰动态变化的重要因素。研究结果表明,舌根和舌背的更大运动幅度会导致共振峰过渡更早且更陡峭,并为理解语音个体差异提供了新的视角。

详情
Journal ref
Journal of the Acoustical Society of America (2026) 159(5): 4068-4078
AI中文摘要

声学元音动态具有一些说话者识别特征,这些特征被归因于发音策略的个体特性:共振峰过渡具有特定形状,因为说话者使用特定且熟练的动作移动发音器官。然而,现有证据很少表明不同的发音策略会系统性地影响共振峰动态。本研究证实了二者之间的联系。使用来自36位北盎格鲁英语说话者的超声舌成像数据,识别出腭元音/i/产生的不同发音策略。发现/i/中的舌形是腭滑音双元音中共振峰动态的重要预测因子。观察到的关系可以通过声道形状调节的发音运动特征来解释。舌根和/或舌背的更大发音位移会产生腭元音中与平均舌形的更大偏差,并且还需要更高的发音速度,导致相对更早且更陡的共振峰过渡。结果通过阐明发音补偿的规律性和个体性方面,有助于对言语个体性的概念理解。

英文摘要

Acoustic vowel dynamics have some speaker-identifying characteristics, which have been ascribed to individual properties of articulatory strategies: formant transitions have a particular shape because speakers move their articulators, using specific and practised movements. However, there is little existing evidence that different articulatory strategies systematically affect formant dynamics. The present study corroborates the link between the two. Ultrasound tongue imaging data from 36 speakers of Northern-Anglo English are used to identify distinct articulatory strategies for the production of palatal vowel /i/. Tongue shape in /i/ is found to be a significant predictor of formant dynamics in diphthongs with a palatal offglide. The observed relationships can be explained by the characteristics of articulatory movement conditioned by vocal tract shape. Greater articulatory displacement of tongue root and/or dorsum produces greater distortion from the mean tongue shape in palatal vowels, and it also requires higher articulatory velocities, resulting in relatively earlier and steeper formant transitions. The results contribute to the conceptual understanding of individuality in speech, by illuminating the regularising and individual aspects of articulatory compensation.

2605.23373 2026-05-25 cs.SD 版本更新

AffectCodec: Emotion-Preserving Neural Speech Codec with Block-Diagonal Residual FSQ

AffectCodec: 具有块对角残差FSQ的情感保持神经语音编解码器

Zhaoyang Meng, Zhengyao Ma, Kecan Mao, Yingming Gao, Ya Li

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 AffectCodec 是一种旨在保留情感信息的神经语音编解码器,针对现有编解码器在量化过程中容易丢失情感线索的问题,提出了基于块对角残差有限标量量化(BD-RFSQ)的结构化量化方法。该方法通过分离情感和声学子空间,显式控制比特分配,确保情感信息在低比特率下仍能得到有效保留。实验表明,AffectCodec 在多个情感语音数据集上显著提升了情感保持能力,同时保持了良好的语音重建质量。

详情
AI中文摘要

神经语音编解码器已成为原始音频与语音语言模型之间的离散接口,但它们仍主要针对声学重建保真度进行优化,这导致情感相关线索在量化过程中容易被丢弃,限制了下游模型的情感能力。我们将这种退化追溯到两种机制:有限比特率下由重建驱动的比特分配,以及基于拼接的编解码器中的跨流泄漏,其中声学梯度可能覆盖名义上保留情感的维度。我们提出AffectCodec,一种基于块对角残差有限标量量化(BD-RFSQ)的情感保持神经语音编解码器。通过在情感和声学子空间上施加块对角输入和输出投影,BD-RFSQ将比特分配从隐式和损失驱动转变为显式和结构保证,同时为下游语音语言模型保留平坦的令牌接口。AffectCodec进一步将这种结构约束的量化器与多粒度情感条件和多速率训练相结合,在低比特率下实现鲁棒的情感保持。在多个情感语音基准上的实验表明,AffectCodec显著改善了情感保持,尤其是在低比特率情况下,同时保持了具有竞争力的声学质量和可懂度。这些结果表明,结构保护的量化是保留情感相关信息的有效原则,并可能为属性感知的神经语音压缩提供一条通用途径。

英文摘要

Neural speech codecs have become the discrete interface between raw audio and speech language models, yet they remain optimized primarily for acoustic reconstruction fidelity, which leaves emotion-relevant cues vulnerable to being discarded during quantization, limiting the affective capacity of downstream models. We trace this degradation to two mechanisms: reconstruction-driven bit allocation under limited bitrate and cross-stream leakage in concatenation-based codecs, where acoustic gradients can overwrite nominally emotion-reserved dimensions. We propose AffectCodec, an emotion-preserving neural speech codec built on Block-Diagonal Residual Finite Scalar Quantization (BD-RFSQ). By imposing block-diagonal input and output projections over emotion and acoustic subspaces, BD-RFSQ transforms bit allocation from implicit and loss-driven to explicit and structurally guaranteed, while still preserving a flat token interface for downstream speech language models. AffectCodec further combines this structurally constrained quantizer with multi-granularity emotion conditioning and multi-rate training, enabling robust affect preservation at low bitrates. Experiments across multiple emotional speech benchmarks show that AffectCodec substantially improves emotion preservation, especially in the low-bitrate regime, while maintaining competitive acoustic quality and intelligibility. These results suggest that structurally protected quantization is an effective principle for preserving emotion-relevant information and may provide a general route toward attribute-aware neural speech compression.

2605.23293 2026-05-25 eess.AS cs.SD eess.SP 版本更新

Evaluating the Temporal Detection Capability of Integrated Gradients Applied on Sound Classifier

评估集成梯度应用于声音分类器的时间检测能力

Martynas Dumpis, Tuomas Virtanen

发表机构 * Department of Electronic Systems(电子系统系) Vilnius Gediminas Technical University(维尔纽斯吉尔迈纳斯技术大学) Signal Processing Research Centre(信号处理研究中心) Tampere University(塔尔皮奥大学)

AI总结 本文评估了基于梯度的归因方法——集成梯度(Integrated Gradients)在无时间监督训练的音频分类器中检测声音事件时间边界的能力。通过合成多声音频和真实时间戳进行对比,研究发现集成梯度在定位声音事件方面表现出一定的有效性,其性能接近于显式生成帧级预测的模型,显著优于随机和能量基方法。实验结果表明,集成梯度能够捕捉声音事件的有意义时间活动模式,为音频分类模型的可解释性研究提供了新的视角。

Comments 5 pages, 3 figures

详情
AI中文摘要

基于梯度的归因方法可以突出对神经网络预测重要的输入区域,但其在音频分类中用于时间声音事件检测的有效性尚未被系统评估。本文评估了集成梯度(IG)在应用于没有时间监督训练的分类器时,能否在时间上检测声音事件。我们使用带有真实时间戳的合成多声道音频来测量IG归因与事件边界之间的对齐程度。在一个10类家庭声音数据集上,IG实现了平均交并比(IoU)0.39、帧级F1分数0.52和Pointing Game准确率82.6%。作为对比,使用弱监督(FW-WS,片段级训练标签)训练的帧级CNN实现了0.42 IoU、0.55 F1和97.3% PG,而强监督变体(FW-SS,帧级训练标签)达到了0.45 IoU、0.58 F1和97.9% PG。总体而言,这些结果表明事后IG捕捉到了声音事件有意义的时序活动模式,其定位性能接近显式产生帧级预测的模型。所有方法都显著优于随机和基于能量的基线。

英文摘要

Gradient-based attribution methods can highlight input regions important for neural network predictions, but their effectiveness for temporal sound event detection in audio classification has not been systematically evaluated. This paper assesses whether integrated gradients (IG) can temporally detect sound events when applied to a classifier trained without temporal supervision. We use synthetic polyphonic audio with ground truth timestamps to measure alignment between IG attributions and event boundaries. On a 10-class domestic sound dataset, IG achieves mean Intersection over Union (IoU) of 0.39, frame-level F1 of 0.52, and Pointing Game accuracy of 82.6\%. For comparison, a framewise CNN trained with weak supervision (FW-WS, clip-level training labels) achieves 0.42 IoU, 0.55 F1, and 97.3\% PG, while a strongly supervised variant (FW-SS, frame-level training labels) reaches 0.45 IoU, 0.58 F1, and 97.9\% PG. Overall, these results suggest that post-hoc IG captures meaningful temporal activity patterns of sound events, with localization performance approaching models that explicitly produce frame-level predictions. All methods substantially outperform random and energy-based baselines.

2605.23261 2026-05-25 eess.AS cs.SD 版本更新

UniSRM: A Unified Speech Reward Model for Reasoning-Based Fine-grained Assessment

UniSRM:一种用于基于推理的细粒度评估的统一语音奖励模型

Yuanyuan Wang, Dongchao Yang, Yayue Deng, Zhiyong Wu, Yiwen Guo, Helen Meng, Xixin Wu

发表机构 * The Chinese University of Hong Kong(香港中文大学) Tsinghua University(清华大学) Independent Researcher(独立研究者)

AI总结 目前语音生成的评估仍主要依赖人工评分,如平均意见得分(MOS),但这种方式成本高、主观性强且难以大规模复现。为解决这一问题,本文提出了一种统一的语音奖励模型UniSRM,能够基于推理过程提供多维度、可解释的评估信号。通过构建UniSRM-Data和UniSRM-Bench数据集,研究实现了从单句质量到上下文连贯性的多样化评估任务,并引入推理一致性奖励机制,显著提升了评估的可靠性与人类对齐程度。

Comments Accepted by ACL 2026(Main)

详情
AI中文摘要

评估语音生成仍然严重依赖人类判断,如平均意见分数(MOS),这些方法昂贵、主观且难以大规模复现。尽管最近一些研究开始探索基于AudioLLM的评判模型,但现有努力通常仅针对狭窄的场景(例如,话语级质量或单轮对话),并且对多样化语音生成任务和评估维度的覆盖有限。在这项工作中,我们提出了UniSRM,一种统一的语音奖励模型,能够支持具有可靠推理的多维、可解释的奖励信号。为了支持训练和评估,我们引入了UniSRM-Data和UniSRM-Bench,涵盖了从话语级质量到上下文级连贯性的语音评估任务。基于该数据集,我们提出了统一的语音奖励模型UniSRM,采用两阶段流水线实现基于推理的细粒度评估。此外,我们引入了推理一致性奖励以提高推理过程的可靠性。实验表明,UniSRM在广泛的语音评估任务中提供了更可靠且与人类一致的判断,为可扩展和统一的语音质量评估提供了实用基础。

英文摘要

Evaluating speech generation still relies heavily on human judgments, such as Mean Opinion Score (MOS), which are expensive, subjective, and difficult to reproduce at scale. While a few recent studies have begun to explore AudioLLM-based judge models, existing efforts typically target only a narrow set of scenarios (e.g., utterance-level quality or single-turn dialogue) and provide limited coverage of diverse speech generation tasks and evaluation dimensions. In this work, we propose UniSRM, a unified speech reward model that can support multi-dimensional, interpretable reward signals with reliable reasoning. To support training and evaluation, we introduce UniSRM-Data and UniSRM-Bench, covering speech evaluation tasks from utterance-level quality to context-level coherence. Based on this dataset, we present the unified speech reward model, UniSRM, with a two-stage pipeline that enables reasoning-based fine-grained assessment. Furthermore, we introduce Reasoning-Consistent Rewards to improve the reliability of the reasoning process. Experiments show that UniSRM delivers more reliable and human-aligned judgments across a broad range of speech evaluation tasks, offering a practical foundation for scalable and unified evaluation of speech quality.

2605.23201 2026-05-25 cs.SD cs.MM 版本更新

MixFake: Benchmarking and Enhancing Audio Deepfake Detection in Diverse Real-world Mixed Audio

MixFake: 在多样真实混合音频中基准测试和增强音频深度伪造检测

Qingcao Li, Yipeng Lin, Weichen Lian, Zhongjie Ba, Peng Cheng, Zhichao Lian

发表机构 * School of Cyber Science and Engineering, Nanjing University of Science and Technology, Nanjing, China(南京理工大学信息科学与工程学院) The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou, China(浙江大学区块链与数据安全国家重点实验室) Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, Hangzhou, China(杭州高新技术区(滨江)区块链与数据安全研究院)

AI总结 本文提出MixFake,一个用于评估和提升音频深度伪造检测性能的大型基准数据集,旨在模拟真实世界中包含背景音乐或噪声的复杂语音环境。为解决现有基于自监督学习的方法在处理非语音或混合源音频时的不足,作者提出了一种多流提示调优框架,通过注入信号级先验信息增强SSL模型对音频伪影的捕捉能力。实验表明,该方法在前景检测和复杂背景检测任务中均显著优于现有方法,取得了优异的检测性能。

Comments Accepted by ICME2026

详情
AI中文摘要

语音深度伪造检测在干净环境中取得了显著成功,但在复杂真实场景中面临重大挑战,因为语音常与背景音乐或噪声混合。当前最先进的方法依赖于自监督学习(SSL)模型的语义特征,但在处理非语音或混合源音频时常常失败。本文首先引入了MixFake,一个大规模基准数据集,旨在模拟具有不同信噪比(SNR)水平和混合真实性成分的多样化声学环境。为了解决“语义中心”限制,我们提出了一个多流提示微调框架,将信号级先验注入SSL骨干网络。通过深度提示注入集成基础流、频率流和纹理流,我们的模型有效捕获了声学伪影。实验结果表明,我们的方法显著优于现有基线,在前景检测中实现了0.95%的等错误率(EER),在复杂背景检测任务中实现了7.72%的绝对改进。我们的数据集和代码可在https://github.com/saltfish233/MixFake获取。

英文摘要

Speech deepfake detection has achieved remarkable success in clean environments but faces significant challenges in complex, real-world scenarios where speech is often mixed with background music or noise. Current state-of-the-art methods rely on semantic features from self-supervised learning (SSL) models, which often fail when processing non-speech or mixed-source audio. In this paper, we first introduce MixFake, a large-scale benchmark dataset designed to simulate diverse acoustic environments with varying SNR levels and mixed authenticity components. To address the "semantic-centric" limitation, we propose a Multi-stream Prompt Tuning framework that injects signal-level priors into SSL backbones. By integrating base, frequency, and texture streams through deep prompt injection, our model effectively captures acoustic artifacts. Experimental results demonstrate that our method significantly outperforms existing baselines, achieving a 0.95% EER in foreground detection and a substantial 7.72% absolute improvement in complex background detection tasks. Our dataset and code are available at https://github.com/saltfish233/MixFake.

2509.15808 2026-05-25 cs.SD eess.AS 版本更新

From Independence to Interaction: Speaker-Aware Simulation of Multi-Speaker Conversational Timing

从独立性到交互性:说话人感知的多说话人对话时序模拟

Máté Gedeon, Péter Mihajlik

发表机构 * Dept. of Telecommunications and Artificial Intelligence(电信与人工智能系) Budapest University of Technology and Economics(布达佩斯技术与经济大学) Speechtex Ltd.(Speechtex有限公司)

AI总结 本文提出了一种关注说话者的多说话人对话时序模拟方法,能够捕捉时间一致性与真实的轮流发言动态。该方法通过引入说话者特定的偏差分布来保证个体时间一致性,并利用马尔可夫链控制发言轮换,结合固定房间脉冲响应以保持空间真实感。实验表明,该方法在多项内在指标上优于传统方法,能更准确地反映真实对话中的时间依赖关系和说话人交替模式。

Comments Submitted to ICASSP 2026

详情
AI中文摘要

我们提出了一种说话人感知的方法来模拟多说话人对话,该方法捕捉了时间一致性和真实的话轮转换动态。先前的工作通常假设说话人和话轮之间的独立性来建模聚合的对话统计。相比之下,我们的方法使用说话人特定的偏差分布来强制执行说话人内部的时间一致性,同时马尔可夫链控制话轮转换,固定的房间脉冲响应保持空间真实性。我们还将停顿和重叠统一为一个单一的间隔分布,并使用核密度估计进行平滑连续性建模。在Switchboard数据集上使用内在指标——全局间隔统计、连续间隔之间的相关性、基于copula的高阶依赖性、话轮转换熵和间隔生存函数——进行评估表明,说话人感知的模拟比基线方法更好地与真实对话模式对齐,捕捉了细粒度的时间依赖和真实的话轮交替,同时揭示了在建模长程对话结构方面的开放挑战。

英文摘要

We present a speaker-aware approach for simulating multi-speaker conversations that captures temporal consistency and realistic turn-taking dynamics. Prior work typically models aggregate conversational statistics under an independence assumption across speakers and turns. In contrast, our method uses speaker-specific deviation distributions enforcing intra-speaker temporal consistency, while a Markov chain governs turn-taking and a fixed room impulse response preserves spatial realism. We also unify pauses and overlaps into a single gap distribution, modeled with kernel density estimation for smooth continuity. Evaluation on Switchboard using intrinsic metrics - global gap statistics, correlations between consecutive gaps, copula-based higher-order dependencies, turn-taking entropy, and gap survival functions - shows that speaker-aware simulation better aligns with real conversational patterns than the baseline method, capturing fine-grained temporal dependencies and realistic speaker alternation, while revealing open challenges in modeling long-range conversational structure.

2502.04230 2026-05-25 cs.SD cs.AI cs.CR cs.LG eess.AS 版本更新

XAttnMark: Learning Robust Audio Watermarking with Cross-Attention

XAttnMark:基于交叉注意力的鲁棒音频水印学习

Yixin Liu, Lie Lu, Jihui Jin, Lichao Sun, Andrea Fanelli

发表机构 * Department of Computer Science, Lehigh University, Bethlehem, PA, USA(莱文斯顿大学计算机科学系) Dolby Laboratories Inc., San Francisco, CA, USA(杜比实验室公司)

AI总结 随着生成式音频合成和编辑技术的快速发展,版权保护、数据溯源和深度伪造音频传播等问题日益突出。本文提出了一种基于交叉注意力机制的鲁棒音频水印方法XAttnMark,通过生成器与检测器之间的部分参数共享、高效的交叉注意力消息检索机制以及时间条件模块,实现了水印检测与归属的联合优化。此外,该方法引入了与心理声学对齐的时频掩码损失,提升了水印的不可感知性,实验表明其在多种音频变换下均表现出优越的鲁棒性,为生成式AI时代的音频版权保护提供了有效解决方案。

Comments Accepted at ICML'25

详情
AI中文摘要

生成式音频合成与编辑技术的快速普及引发了关于版权侵权、数据溯源以及通过深度伪造音频传播虚假信息的严重担忧。水印技术通过将不可感知但可识别和可追踪的信号嵌入音频内容,提供了一种主动解决方案。尽管最近基于神经网络的水印方法(如WavMark和AudioSeal)在鲁棒性和质量上有所改进,但它们难以同时优化鲁棒检测和准确归因。本文介绍了交叉注意力鲁棒音频水印(XATTNMARK),通过利用生成器和检测器之间的部分参数共享、用于高效消息检索的交叉注意力机制以及用于改善消息分布的时间条件模块,弥合了这一差距。此外,我们提出了一种心理声学对齐的时频(TF)掩蔽损失,捕捉细粒度的听觉掩蔽效应,提高了水印的不可感知性。XATTNMARK在检测和归因方面均达到了最先进的性能,展示了针对各种音频变换(包括不同强度的具有挑战性的生成式编辑)的卓越鲁棒性。这项工作推进了音频水印技术,用于在生成式AI时代保护知识产权并确保真实性。

英文摘要

The rapid proliferation of generative audio synthesis and editing technologies has raised serious concerns about copyright infringement, data provenance, and the spread of misinformation via deepfake audio. Watermarking offers a proactive solution by embedding imperceptible yet identifiable and traceable signals into audio content. While recent neural network-based watermarking methods like WavMark and AudioSeal have improved robustness and quality, they struggle to jointly optimize both robust detection and accurate attribution. This paper introduces Cross-Attention Robust Audio Watermark (XATTNMARK), which bridges this gap by leveraging partial parameter sharing between the generator and the detector, a cross-attention mechanism for efficient message retrieval, and a temporal conditioning module for improved message distribution. Additionally, we propose a psychoacoustic-aligned time-frequency (TF) masking loss that captures fine-grained auditory masking effects, improving watermark imperceptibility. XATTNMARK achieves state-of-the-art performance in both detection and attribution, demonstrating superior robustness against a wide range of audio transformations, including challenging generative editing at varying strengths. This work advances audio watermarking for protecting intellectual property and ensuring authenticity in the era of generative AI.

2605.20519 2026-05-25 cs.SD cs.AI 版本更新

Codec-Robust Attacks on Audio LLMs

针对音频大语言模型的编解码鲁棒攻击

Jaechul Roh, Jean-Philippe Monteuuis, Jonathan Petit, Amir Houmansadr

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Qualcomm(高通)

AI总结 本文研究了针对音频大语言模型(Audio LLMs)的编码器鲁棒攻击方法,提出了一种名为CodecAttack的新攻击技术。该方法在神经音频编码器的连续潜在空间中优化扰动,而非直接对音频波形进行修改,从而绕过压缩过程对波形扰动的过滤。实验表明,CodecAttack在多种真实压缩场景下表现出显著的攻击成功率,远高于传统波形域攻击方法,揭示了有损压缩并不能有效防御对抗性音频攻击。

详情
AI中文摘要

先前对音频大语言模型(Audio LLMs)的攻击表明,精心设计的波形域扰动可以迫使目标对抗性输出。作为针对这些攻击的防御机制,现实中的编解码压缩预处理已被研究用于检测和移除扰动。然而,现有攻击尚未证明对这些压缩的鲁棒性。我们提出CodecAttack,它在神经音频编解码器的连续潜在空间中优化扰动,而不是直接扰动音频波形。我们表明,编解码器的压缩通道会丢弃波形扰动,但会传输在其自身潜在空间中设计的扰动。为了进一步增强攻击在现实压缩通道中的鲁棒性,我们应用了多比特率直通期望变换(EoT),而无需修改目标模型。在三种现实的音频LLM部署场景和三个目标模型上,CodecAttack在中等比特率下对Opus实现了平均85.5%的目标子串攻击成功率(ASR),而使用相同EoT加固训练的波形基线在任何比特率下均未超过26%。该攻击可迁移到未训练的编解码器,在MP3上达到100% ASR,在AAC-LC上达到84% ASR,无需重新训练。逐频带能量分析表明,潜在扰动集中在4kHz以下,这正是编解码器分配最多比特的区域,而波形基线则扩散到编解码器丢弃的高频区域。这些结果表明,有损压缩不是对抗音频的可靠防御,编解码感知攻击对已部署的音频LLM系统构成了实际威胁。

英文摘要

Prior attacks on Audio Large Language Models (Audio LLMs) demonstrated that carefully crafted waveform-domain perturbations can force targeted adversarial outputs. As a defense mechanism against these attacks, real-world codec compression preprocessing has been studied to both detect and remove the perturbations. Yet no existing attack has demonstrated robustness against these compressions. We introduce CodecAttack, which optimizes a perturbation in a neural audio codec's continuous latent space rather than directly perturbing the audio waveform. We show that the codec's compression channel, which discards waveform perturbations, transmits perturbations crafted in its own latent space. To further harden the attack across real-world compression channels, we apply multi-bitrate straight-through Expectation-over-Transformation (EoT), all without modifying the target model. Across three realistic Audio LLM deployment scenarios and three target models, CodecAttack achieves an average 85.5% target-substring attack success rate (ASR) on Opus at moderate bitrates, while the waveform baseline trained with identical EoT hardening does not exceed 26% at any bitrate. The attack transfers to held-out codecs, reaching up to 100% ASR on MP3 and 84% on AAC-LC without retraining. A per-band energy analysis shows that the latent perturbation concentrates below 4kHz, exactly where codecs allocate the most bits, while the waveform baseline spreads into higher frequencies that codecs discard. These results demonstrate that lossy compression is not a reliable defense against adversarial audio and that codec-aware attacks pose a practical threat to deployed Audio LLM systems.