arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.05161 2026-06-04 cs.SD cs.CL 版本更新

Beyond Text Following: Repairable Arbitration Reversals in Audio-Language Models

超越文本跟随：音频-语言模型中的可修复仲裁反转

Yichen Gao, Yiqun Zhang, Zijing Wang, Yujia Li, Heng Guo, Xi Wu, Xiaocui Yang, Shi Feng, Yifei Zhang, Daling Wang

发表机构 * Northeastern University, China（东北大学）； Shanghai Artificial Intelligence Laboratory, China（上海人工智能实验室）

AI总结本文通过同音频反事实实验发现，音频-语言模型在冲突任务中常因文本主导而忽略音频证据，并提出无训练解码规则GACL，通过插值联合分数与同音频分数来修复仲裁反转，显著提升忠实度。

详情

AI中文摘要

音频-语言模型（ALMs）常常遵循与音频冲突的文本，即使音频证据清晰。这引发了一个基本问题：音频支持的答案是不可用的，还是被表示出来但被冲突文本覆盖了？我们使用一个同音频反事实来研究这个问题，该反事实保持音频固定，仅移除冲突文本，并测量模型偏好由此产生的变化。在五个ALM和四个冲突任务中，64.1%的冲突样本显示出符号翻转：同音频分支偏好音频支持的答案，而联合分支偏好文本支持的答案。这种模式表明，相关的音频证据被编码但在仲裁中失败。激活修补进一步将反转定位到答案位置计算，并且修补效果与输出候选分数差异紧密相关（Spearman rho=0.93）。利用这一诊断，我们提出了门控音频反事实逻辑校正（GACL），一种无训练解码规则，在联合分数和同音频分数之间进行插值。在严格的5个百分点的忠实度下降预算下，GACL在最佳对比基线上将nAUC提高了17.8个点，并且无需重新调整即可迁移到视觉-文本仲裁（最高+40.5个百分点）。

英文摘要

Audio-language models (ALMs) often follow text that conflicts with audio, even when the audio evidence is clear. This raises a basic question: is the audio-supported answer unavailable, or is it represented but overridden by the conflicting text? We examine this question using a same-audio counterfactual that keeps the audio fixed, removes only the conflicting text, and measures the resulting shift in model preference. Across five ALMs and four conflict tasks, 64.1% of conflict samples show a sign flip: the same-audio branch prefers the audio-supported answer, whereas the joint branch prefers the text-supported answer. This pattern suggests that the relevant audio evidence is encoded but loses in arbitration. Activation patching further localizes the reversal to answer-position computation, and patching effects closely track output candidate-score differences (Spearman rho=0.93). Using this diagnostic, we propose Gated Audio Counterfactual Logit Correction (GACL), a training-free decoding rule that interpolates between joint and same-audio scores. Under a strict 5 pp faithfulness-drop budget, GACL improves nAUC by 17.8 points over the best contrastive baseline and transfers without retuning to vision-text arbitration (up to +40.5 pp).

URL PDF HTML ☆

赞 0 踩 0

2606.05121 2026-06-04 cs.SD cs.AI cs.CL cs.MM eess.AS 版本更新

Audio Interaction Model

音频交互模型

Zhifei Xie, Zihang Liu, Ze An, Xiaobin Hu, Yue Liao, Ziyang Ma, Dongchao Yang, Mingbao Lin, Deheng Ye, Shuicheng Yan, Chunyan Miao

发表机构 * NTU（国立新加坡大学）； NUS（新加坡国立大学）； CUHK（香港大学）

AI总结提出一种统一的在线大型音频语言模型Audio-Interaction，通过始终在线的感知-决策-响应循环实现实时音频交互，并构建了StreamAudio-2M数据集和Proactive-Sound-Bench基准，在保持主流音频任务性能的同时解锁了实时ASR、流式音频指令跟随和主动帮助等能力。

Comments Next generation of LALMs, work in progress

详情

AI中文摘要

音频本质上是一种交互式模态，然而当今的大型音频语言模型（LALM）是离线的，而流式音频模型每个只处理单一任务，如流式ASR或语音聊天。现在是时候将它们统一为一个在线LALM：一个通过始终在线的感知-决策-响应循环，实时收听声音、环境和指令并即时反应的模型。我们将这种机制形式化为音频交互模型，并通过Audio-Interaction实现，这是一个统一的流式模型，在保留离线任务执行的同时，增加了在线通用音频指令跟随能力，从对话到全语音聊天，根据流语义决定何时响应。为此，我们提出了SoundFlow框架，该框架通过流原生数据构建、理解感知训练和异步低延迟推理，端到端地实例化感知-决策-响应循环，实现稳定的实时交互。我们进一步构建了StreamAudio-2M，一个包含260万项流式语料库，涵盖7种基本能力和28个子任务，以及用于评估主动音频干预的Proactive-Sound-Bench。在8个基准测试中，Audio-Interaction在主流音频任务上保持有竞争力的性能，同时解锁了离线LALM无法实现的能力，包括实时ASR、流式音频指令跟随和主动帮助。

英文摘要

Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on perceive-decide-respond loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the Audio Interaction Model, and realize it with Audio-Interaction, a unified streaming model that retains offline task execution while adding online general audio instruction following, from dialogue to full voice chatting, deciding when to respond from the semantics of the stream. To enable this, we propose SoundFlow, a framework that instantiates the perceive-decide-respond loop end to end, from data to training to deployment, through streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference for stable real-time interaction. We further construct StreamAudio-2M, a 2.6M-item streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Across 8 benchmarks, Audio-Interaction preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive help.

URL PDF HTML ☆

赞 0 踩 0

2606.05101 2026-06-04 cs.SD cs.LG 版本更新

FoeGlass: Simple In-Context Learning Is Enough for Red Teaming Audio Deepfake Detectors

FoeGlass: 简单的上下文学习足以对音频深度伪造检测器进行红队测试

Sepehr Dehdashtian, Jacob H Seidman, Vishnu N Boddeti, Gaurav Bharaj

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出FoeGlass，一种基于大语言模型上下文学习的黑盒自动红队方法，通过生成音频样本发现深度伪造检测器的盲点，将假阴性率降低高达94%。

Comments Accepted at ICML 2026

详情

AI中文摘要

音频深度伪造检测（ADD）模型对于对抗文本转语音（TTS）模型的恶意使用至关重要。评估和增强ADD模型需要开发覆盖生成音频空间并突出高错误区域的数据集。现有数据集开发策略面临两个挑战：（i）手动收集，以及（ii）低效发现ADD模型中的盲点。为应对这些挑战，我们提出FoeGlass，这是首个针对ADD的黑盒自动红队方法，能有效发现最先进深度伪造基准未充分探索的生成音频空间中的ADD失败模式。FoeGlass利用大语言模型的上下文学习能力探索TTS模型的输入空间，仅通过黑盒访问所有组件即可生成欺骗目标ADD的音频样本。通过使用基于多样性度量精心设计的上下文，FoeGlass缓解了自动红队系统中常见的模式崩溃问题。在多个开源ADD和TTS模型上的实证评估表明，与无条件采样基线和最近的欺骗数据集相比，FoeGlass生成的数据将假阴性率大幅提升高达94%，且无需人工监督。此外，我们证明FoeGlass生成的攻击在不同目标ADD之间具有可迁移性，展示了其在ADD系统自动红队中的广泛适用性和易用性。最后，在FoeGlass生成的样本上微调ADD模型显著增强了检测器的鲁棒性（提升高达41%）。

英文摘要

Audio deepfake detection (ADD) models are critical for countering the malicious use of text-to-speech (TTS) models. Evaluating and strengthening ADD models requires developing datasets that span the space of generated audio and highlight high-error regions. Existing dataset development strategies face two challenges: (i) manual collection, and (ii) inefficient discovery of blind spots in the ADD models. To address these challenges, we propose FoeGlass, the first black-box automated red-teaming method for ADDs, which effectively discovers ADD failure modes in the space of generated audio underexplored by state-of-the-art deepfake benchmarks. FoeGlass uses the in-context learning capabilities of an LLM to explore the input space of a TTS model, generating audio samples that fool the target ADD using only black-box access to all components. By using a carefully designed context based on diversity measurements, FoeGlass mitigates the common problem of mode collapse in automated red-teaming systems. Empirical evaluations on several open-source ADD and TTS models demonstrate that data generated from FoeGlass substantially improves the false negative rates over unconditional sampling baselines and recent spoofing datasets by up to 94%, while requiring no manual supervision. Furthermore, we show that the attacks generated by FoeGlass are transferable across different target ADDs, demonstrating its broad applicability and ease of use for the automated red teaming of ADD systems. Finally, fine-tuning ADD models on FoeGlass-generated samples notably enhances the robustness of the detectors (up 41%).

URL PDF HTML ☆

赞 0 踩 0

2606.04921 2026-06-04 cs.SD eess.AS 版本更新

SURF: Separation via Unsupervised Remixing Flow

SURF: 通过无监督重混流进行分离

Henry Li, Robin Scheibler, Efthymios Tzinis, Matt Shannon, Arnaud Doucet, John R. Hershey

发表机构 * University of Toronto（多伦多大学）

AI总结提出无监督流匹配方法SURF，直接从混合信号学习源分离，结合监督流匹配与自监督回归，通过重混步骤引导学生模型，在图像和音频基准上达到新最优。

Comments Accepted at ICML 2026

详情

AI中文摘要

单通道源分离的目标是从混合信号中重建$K$个源。在监督设置中，当有大量干净源数据可用时，这个具有挑战性的不适定问题已通过生成扩散和基于流的先验模型成功解决。然而，获取此类干净源样本通常受限，即使可用，监督模型也容易受到领域偏移的影响。为弥补这一差距，我们提出了通过无监督重混流进行分离（SURF），这是一种无监督流匹配方法，直接从观测到的混合信号中学习。该方法依赖于最先进的监督流匹配和基于回归的自监督技术的新颖组合。在高层面上，从教师模型开始，我们利用“重混”步骤，从教师估计中引导学习学生流模型。我们提供了关于该方法优化目标的见解，并建立了与Wake-Sleep算法的新联系。在图像和音频基准上的实证评估表明，SURF建立了新的最优水平，显著优于现有无监督方法。示例请参见我们的演示页面：https://google.github.io/df-conformer/surf/

英文摘要

The goal of single-channel source separation is to reconstruct $K$ sources given their mixture. In supervised settings where vast amounts of clean source data are available, this challenging, ill-posed problem has been addressed successfully by generative diffusion and flow-based prior models. However, access to such clean source samples is often limited, and even when available, supervised models are vulnerable to domain shifts. To bridge this gap, we present Separation via Unsupervised Remixing Flow (SURF), an unsupervised flow matching approach for source separation that learns directly from observed mixtures. This method relies on a novel combination of state-of-the-art supervised flow matching and regression-based self-supervised techniques. At a high level, starting from a teacher model, we utilize a "remixing" step to bootstrap the learning of a student flow model from the teacher's estimates. We provide insights into the objectives optimized by this approach and draw a novel connection to the Wake-Sleep algorithm. Empirical evaluations on image and audio benchmarks demonstrate that SURF establishes a new state-of-the-art, significantly outperforming existing unsupervised methods. See our demo page for examples. https://google.github.io/df-conformer/surf/

URL PDF HTML ☆

赞 0 踩 0

2606.04844 2026-06-04 cs.SD cs.CV 版本更新

Drift-Augmented Scoring: Text-Derived Noise Robustness for Zero-Shot Audio-Language Classification

漂移增强评分：文本驱动的零样本音频-语言分类噪声鲁棒性

Tu Vo, Sheir Zaheer, Chan Y. Park

发表机构 * Anonymous Authors（匿名作者）

AI总结提出漂移增强评分（DAS），通过文本生成的噪声条件提示预测音频嵌入漂移方向，为每个类别添加奖励分数，在不增加梯度或测试时批处理的情况下，显著提升零样本音频分类在噪声下的准确率和mAP。

详情

AI中文摘要

对比音频-语言模型（如CLAP）能够实现零样本音频分类：通过将音频嵌入与文本提示嵌入匹配来标记声音，无需标注音频。但在声学噪声下，这种匹配会失效，标准基准测试中，0 dB SNR时准确率和mAP下降12-30个百分点。我们提出漂移增强评分（DAS），这是一种添加到余弦评分中的每类小奖励。当噪声音频嵌入向该类噪声条件文本提示预测的方向漂移时，奖励该类。该奖励仅从文本推导，计算一次并缓存，推理时每类只需一个内积，无需梯度或测试时批处理。在LAION CLAP骨干网络上，我们将DAS与Acevedo等人同期方法的四种变体在UrbanSound8K和完整FSD50K评估集上进行比较，将每个片段与城市声学场景噪声混合，覆盖一系列SNR。DAS在所有测试条件下均提升了指标：UrbanSound8K上准确率提高+2.60至+5.75个百分点，FSD50K上mAP提高+1.50至+1.74个百分点。

英文摘要

Contrastive audio-language models such as CLAP enable zero-shot audio classification: a sound is labelled by matching its embedding to text prompt embeddings, with no labelled audio. This matching breaks down under acoustic noise, where accuracy and mAP fall by 12-30 percentage points at 0 dB SNR on standard benchmarks. We propose Drift Augmented Scoring (DAS), a small per-class bonus added to the cosine score. The bonus rewards a class when the noisy audio embedding drifts in the direction that the class's noise-conditioned text prompts predict. It is derived from text alone, computed once and cached, and adds a single inner product per class at inference, with no gradients and no test-time batch. On a LAION CLAP backbone, we compare DAS against the four variants of Acevedo et al.'s concurrent method on UrbanSound8K and the full FSD50K eval set, mixing each clip with urban acoustic scene noise across a range of SNRs. DAS improves the metric on every test condition: by +2.60 to +5.75 accuracy points on UrbanSound8K and +1.50 to +1.74 mAP points on FSD50K.

URL PDF HTML ☆

赞 0 踩 0

2606.04680 2026-06-04 eess.AS cs.CL cs.SD 版本更新

Read What You Hear: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy

听你所写：基于声学差异的无参考假设评估

Zhihan Li, Hankun Wang, Yiwei Guo, Bohan Li, Xie Chen, Kai Yu

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China（X-LANCE实验室、计算机科学学院、上海交通大学、中国）； MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing, China（人工智能MOE重点实验室、江苏省语言计算重点实验室、中国）

AI总结提出READ指标，利用预训练自回归TTS模型计算语音与文本假设的声学差异，无需参考转录即可评估ASR假设，并在噪声条件下实现高达20%的相对错误率降低。

Comments Submitted to Interspeech 2026. 6 pages, 4 figures

2606.04584 2026-06-04 cs.SD 版本更新

SHB-AE: Spherical harmonic beamforming based Ambisonics encoding and upscaling method for smartphone microphone array

SHB-AE：基于球谐波束形成的智能手机麦克风阵列Ambisonics编码与升级方法

Yuhuan You, Yufan Qian, Tianshu Qu, Bin Wang, Xueyang Lv

发表机构 * Peking University（北京大学）； Beijing Xiaomi Mobile Software Co., Ltd（北京小米移动软件有限公司）； Xiaomi Communications Co., Ltd（小米通讯有限公司）

AI总结针对智能手机麦克风阵列，提出一种基于球谐波束形成的Ambisonics编码与升级方法SHB-AE，通过设计各阶球谐函数的波束形成器，仅用四个非规则排列的麦克风即可实现四阶Ambisonics编码与升级。

Comments Accepted for presentation at AES Europe 2025 Convention (AES 158th Convention), Warsaw, Poland, May 22-24, 2025

详情

AI中文摘要

随着虚拟现实（VR）和增强现实（AR）的快速发展，空间音频录制与回放引起了越来越多的研究兴趣。高阶Ambisonics（HOA）因其对各种播放设备的适应性以及整合头部朝向的能力而脱颖而出。然而，当前的HOA录制通常依赖于笨重的球形麦克风阵列（SMA），而智能手机等便携设备受到阵列配置和麦克风数量的限制。我们提出SHB-AE，一种基于球谐波束形成的Ambisonics编码方法，适用于智能手机麦克风阵列（SPMA）。通过基于阵列流形为各阶球谐函数设计波束形成器，该方法实现了Ambisonics编码与升级。在真实SPMA及其模拟自由场对应物上，在噪声和混响条件下的验证表明，该方法仅用四个非规则排列的麦克风即可成功编码并升级至四阶Ambisonics。

英文摘要

With the rapid development of virtual reality (VR) and augmented reality (AR), spatial audio recording and reproduction have gained increasing research interest. Higher Order Ambisonics (HOA) stands out for its adaptability to various playback devices and its ability to integrate head orientation. However, current HOA recordings often rely on bulky spherical microphone arrays (SMA), and portable devices like smartphones are limited by array configuration and number of microphones. We propose SHB-AE, a spherical harmonic beamforming based method for Ambisonics encoding using a smartphone microphone array (SPMA). By designing beamformers for each order of spherical harmonic functions based on the array manifold, the method enables Ambisonics encoding and up-scaling. Validation on a real SPMA and its simulated free-field counterpart in noisy and reverberant conditions showed that the method successfully encodes and up-scales Ambisonics up to the fourth order with just four irregularly arranged microphones.

URL PDF HTML ☆

赞 0 踩 0

2606.04570 2026-06-04 cs.SD 版本更新

Flow-HOA: Generative Joint Optimization for Ambisonics Encoding via Flow Matching

Flow-HOA：基于流匹配的Ambisonics编码生成式联合优化

Yuhuan You, Yufan Qian, Tianshu Qu, Bin Wang, Xueyang Lv

发表机构 * State Key Laboratory of General Artificial Intelligence（通用人工智能国家重点实验室）； School of Intelligence Science and Technology（智能科学与技术学院）； Peking University（北京大学）； Beijing Xiaomi Mobile Software Co., Ltd（北京小米移动软件有限公司）； Xiaomi Communications Co., Ltd（小米通讯有限公司）

AI总结提出Flow-HOA生成框架，通过条件流匹配联合优化时域、频谱和空间保真度，生成可部署的FIR编码滤波器组，在合成数据和真实录音上均优于强基线方法。

Comments Accepted for presentation at AES Europe 2026 Convention (AES 160th Convention), Copenhagen, Denmark, May 28-30, 2026

详情

AI中文摘要

从稀疏、不规则的麦克风阵列进行高阶Ambisonics（HOA）编码仍然是沉浸式通信和XR中消费级空间音频捕获的关键挑战。我们提出Flow-HOA，一个生成式框架，联合优化包含时域、频谱和空间保真度的多维目标，同时生成可部署的、时不变的有限脉冲响应（FIR）编码滤波器组。通过条件流匹配，模型学习将简单先验分布映射到FIR滤波器系数的目标分布。训练由复合损失引导，平衡时域波形保真度、多分辨率频谱一致性、子带能量保持和空间指向性约束。在合成模拟数据上的客观评估表明，在信号保真度和空间准确性指标上均优于强模型基线。在真实麦克风阵列录音上的主观听音测试进一步证实，Flow-HOA能产生更高的整体音质并减少伪影，展示了从合成训练数据到真实捕获条件的泛化能力。

英文摘要

Higher-Order Ambisonics (HOA) encoding from sparse, irregular microphone arrays remains a critical challenge for consumer spatial audio capture in immersive communication and XR. We propose Flow-HOA, a generative framework that jointly optimizes a multi-dimensional objective encompassing time-domain, spectral, and spatial fidelity while producing a deployable, time-invariant bank of Finite Impulse Response (FIR) encoding filters. Using conditional flow matching, the model learns to map a simple prior distribution to the target distribution of FIR filter coefficients. Training is guided by a composite loss that balances time-domain waveform fidelity, multi-resolution spectral consistency, sub-band energy preservation, and spatial directivity constraints. Objective evaluations on synthetically simulated data demonstrate improved performance over strong model-based baselines in both signal fidelity and spatial accuracy metrics. Subjective listening tests on real microphone array recordings further confirm that Flow-HOA yields higher overall sound quality with reduced artifacts, demonstrating generalization from synthetic training data to real-world capture conditions.

URL PDF HTML ☆

赞 0 踩 0

2606.04475 2026-06-04 cs.SD cs.MM math.SP 版本更新

A Second-Order Cepstral Signature of Contact-Vibration Sounds Reproduced by Laptop Loudspeakers: A Synthetic Case Study

笔记本电脑扬声器再现的接触振动声音的二阶倒谱特征：一个合成案例研究

Jim Salsman

发表机构 * TalkNicer, Inc.（TalkNicer公司）

AI总结通过合成信号链分析，提出接触振动声音在笔记本电脑扬声器再现时具有一阶和二阶倒谱周期性结构，其中二阶倒谱双峰性在机械源和扬声器播放时最明显。

Comments 11 pages, 4 tables, 5 figures, 8 references

详情

AI中文摘要

手机在硬表面上振动时，通过笔记本电脑扬声器再现的声音通常在质量上不同于普通的视听录音。我们提出这种感知独特性的部分原因可以描述为嵌套周期性：一阶倒谱结构反映振动周期及其倍数，二阶倒谱结构反映一阶倒谱内的重复间隔。将感知效应视为真实的，并使用刻意透明的合成信号链，我们建模了六个阶段：机械生成、表面和空气传播、麦克风捕获、编码和解码、笔记本电脑扬声器播放以及重新录制或后处理。合成分析表明，一阶倒谱周期性在整个链中得以保留，而更干净的双峰或准双峰二阶倒谱特征在机械源和笔记本电脑扬声器播放时最为明显。该结果支持但未证明以下假设：笔记本电脑再现可以重新强调潜在的接触振动周期性，而这种周期性在中间记录和编码形式中表达得不够清晰。我们将二阶倒谱双峰性视为接触振动播放的探索性描述符，而非完整的感知度量。所需的验证包括真实设备的录音、受控的播放传递函数、感知判断以及与普通语音、音乐和环境录音的比较。

英文摘要

A mobile phone vibrating on a hard surface often sounds qualitatively unlike ordinary audiovisual recordings when reproduced through laptop loudspeakers. We propose that part of this perceptual distinctiveness can be described as a nested periodicity: a first-order cepstral structure reflecting the vibration period and its multiples, and a second-order cepstral structure reflecting repeated spacing within the first-order cepstrum. Treating the perceptual effect as real and using a deliberately transparent synthetic signal chain, we model six stages: mechanical generation, surface and air propagation, microphone capture, encoding and decoding, laptop-speaker playback, and re-recording or post-processing. The synthetic analysis shows that the first-order cepstral periodicity is preserved across the chain, whereas a cleaner bimodal or quasi-bimodal second-order cepstral signature is most evident at the mechanical source and at laptop-speaker playback. The result supports, but does not prove, the hypothesis that laptop reproduction can re-emphasize a latent contact-vibration periodicity that is less cleanly expressed in intermediate recorded and encoded forms. We frame second-order cepstral bimodality as an exploratory descriptor of contact-vibration playback rather than as a completed perceptual metric. Required validation includes recordings of real devices, controlled playback transfer functions, perceptual judgments, and comparisons against ordinary speech, music, and environmental recordings.

URL PDF HTML ☆

赞 0 踩 0

2606.04418 2026-06-04 cs.SD cs.CL eess.AS 版本更新

CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding

CleanCodec：通过感知引导编码实现高效且鲁棒的语音分词化

Eugene Kwek, Feng Liu, Rui Zhang, Wenpeng Yin

发表机构 * Pennsylvania State University（宾夕法尼亚州立大学）； Drexel University（德雷塞尔大学）

AI总结提出CleanCodec，一种去噪音频编解码器，通过选择性信息瓶颈编码仅保留感知重要特征，以12.5 tokens/s实现最先进的分词效率，在说话人相似度和语音可懂度上显著优于现有编解码器，并在下游任务中实现高达17倍推理加速。

详情

AI中文摘要

神经音频编解码器是语音处理流程的关键组件，将音频压缩为离散令牌以供下游建模。然而，现有编解码器难以平衡重建质量与令牌效率，常常以牺牲语言和声学有意义内容为代价，编码背景噪声和录音伪影等感知无关信息。我们将音频分词化重新定义为选择性信息瓶颈问题，并提出CleanCodec，一种去噪音频编解码器，学习仅编码感知重要特征并丢弃不可感知信息。在每秒仅12.5个令牌的情况下，CleanCodec实现了最先进的分词效率，在说话人相似度和语音可懂度上大幅优于现有编解码器。在下游文本到语音和语音转换任务上的评估进一步展示了改进的性能和高达17倍的推理加速，凸显了显著的效率提升。

英文摘要

Neural audio codecs are a key component of speech processing pipelines, compressing audio into discrete tokens for downstream modeling. However, existing codecs struggle to balance reconstruction quality with token efficiency, often encoding perceptually irrelevant information such as background noise and recording artifacts at the expense of linguistically and acoustically meaningful content. We reframe audio tokenization as a selective information bottleneck problem and propose CleanCodec, a denoising audio codec which learns to encode only perceptually important features and discard imperceptible information. At just 12.5 tokens per second, CleanCodec achieves state-of-the-art tokenization efficiency, substantially outperforming existing codecs in speaker similarity and speech intelligibility. Evaluations on downstream text-to-speech and voice conversion tasks further demonstrate improved performance and up to 17x faster inference, highlighting significant efficiency gains.

URL PDF HTML ☆

赞 0 踩 0

2606.04370 2026-06-04 eess.AS cs.SD eess.SP 版本更新

Masked Wavelet Scattering Transform Neural Field for Sound Field Reconstruction

掩蔽小波散射变换神经场用于声场重建

Xinmeng Luan, Samuel A. Verburg, Efren Fernandez-Grande, Gary Scavone

发表机构 * Fonds de recherche du Québec – Nature et technologies（魁北克自然与技术研究基金）

AI总结提出一种利用小波散射变换作为多尺度特征提取器，结合神经场优化和掩蔽策略，实现稀疏观测下声场重建的方法，并在HRTF上采样中验证有效性。

Comments 5 pages, 2 figures, conference

2606.04358 2026-06-04 cs.SD eess.AS math.CO 版本更新

Gauss Circle Lattices with Geometric Convolutions for Synthesizing High Dimensional Image-Source Room Impulse Responses

基于几何卷积的高斯圆格点用于合成高维图像源房间脉冲响应

Yuancheng Luo

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出一种将图像源模型中的格点计数问题转化为经典高斯圆问题的方法，通过几何卷积将计算复杂度从O(k^N)降低到O(N k^2 log k)，并扩展至频率依赖和反射加权的更高维图像源。

Comments Accepted for publication at the 29th International Conference on Digital Audio Effects 2026

详情

AI中文摘要

图像源模型（ISM）是一种广泛采用的方法，用于在镜面反射假设下高效模拟声学房间脉冲响应（RIR）。源和接收器之间的声学路径被追踪到从房间边界平面的连续反射计算出的格点。矩形房间将图像源的总数限制为RIR持续时间或等效距离k的多项式，次数等于房间维度数N。因此，直接ISM模拟的计算上界为O(k^N)，并且为了可处理性和实际应用，仅考虑N≤3的情况。本文提出了一种替代计算方法，通过将ISM格点计数简化为经典高斯圆问题（GCP），将整数坐标和房间维度的渐近计算界降低到O(N k^2 log k)。我们将格点计数模型扩展到更高维度的频率依赖和反射加权图像源，通过卷积算子关联连续维度之间的解。给出了两种实现RIR的构造方法，以及时频控制、误差和运行时间分析以及RIR统计量。

英文摘要

The image-source model (ISM) is a widely adopted method for efficiently simulating acoustic room impulse responses (RIRs) under specular reflection assumptions. Acoustic paths between source and receiver are traced to lattice points computed from successive reflections over bounding planes of the room. Rectangular rooms bound the total number of image-sources to be polynomial in the RIR's duration or distance $k$ equivalent, with degree equal the number of room dimensions $N$. Direct ISM simulations are therefore compute upper-bound by $O \left ( k^N \right )$, and consider only cases of $N \leq 3$ for tractability and real-world applications. This work proposes an alternative computational method that lowers the asymptotic compute bound to $O \left ( N k^2 \log k \right )$ for integer coordinates and room dimensions via reducing ISM lattice point counting to the classic Gauss circle problem (GCP). We extend the lattice counting model to frequency-dependent and reflection weighted image-sources in higher dimensions, relating solutions between successive dimensions via the convolution operator. Two constructions for realizing RIRs are presented, along with time-frequency controls, error and run-time analysis, and RIR statistics.

URL PDF HTML ☆

赞 0 踩 0

2606.04221 2026-06-04 cs.SD cs.AR eess.AS 版本更新

Feasibility of Time-Domain DNN-Based Speech Enhancement on Embedded FPGA for Hearing Aid

基于时域DNN的助听器嵌入式FPGA语音增强可行性研究

Feyisayo Olalere, Umut Altin, Kiki van der Heijden, Marcel van Gerven

发表机构 * Radboud University, Donders Institute for Brain, Cognition, and Behaviour, The Netherlands（拉德堡德大学，脑认知行为研究所，荷兰）； Mortimer B. Zuckerman Mind, Brain, Behavior Institute, Columbia University, USA（莫蒂默·B·齐克曼心智、大脑与行为研究所，哥伦比亚大学，美国）

AI总结本文在AMD-Xilinx Kria KV260上部署轻量级SuDoRM-RF++模型，通过FP32和16位定点精度评估语音分离和降噪，发现数据移动是主要瓶颈，定点降噪加速器达到9.7ms首样本延迟，满足10ms临床阈值。

Comments 13 pages

2606.04210 2026-06-04 eess.AS cs.LG cs.SD 版本更新

Representation Matters in Randomized Smoothing for Audio Classification

表示在音频分类的随机平滑中至关重要

Jong-Ik Park, Shreyas Chaudhari, José M. F. Moura, Carlee Joe-Wong

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结研究随机平滑在音频分类中的表示问题，通过实验揭示预处理和表示选择对认证鲁棒性的影响，并提出报告规范。

详情

AI中文摘要

随机平滑（RS）在添加高斯噪声的向量空间中认证鲁棒性。在音频分类中，该空间通常不是唯一确定的，因为标准流程会对波形进行归一化、范围控制，并将其转换为log-mel或其他频谱特征。我们表明，除非认证对象和预处理策略明确，否则直接RS是欠定义的。在两个音频基准（关键词识别和环境声音分类）上，我们研究了波形、特征空间和后处理平滑。我们的诊断显示了为什么表示感知的报告是必要的：在相同的平滑水平$σ=0.0025$下，两个数据集共享相同的中位数原始半径$.007996$，但不同的波形能量产生不同的SNR等效尺度（$83.98$ vs. $90.97$ dB）；log-mel平滑在环境声音上给出更高的正半径认证准确率（$68.42\%$ vs. $65.53\%$），认证了更多具有非零半径的样本，但基于特征而非波形；裁剪或峰值归一化将有效扰动范数改变约$230$--$351\times$。因此，我们建议音频RS研究选择并报告任务特定的认证对象和扰动模型，包括扰动位置、增益策略、原始半径以及任何噪声后的几何变化。

英文摘要

Randomized smoothing (RS) certifies robustness in the vector space where Gaussian noise is added. In audio classification, this space is often not uniquely defined as standard pipelines normalize, range-control, and transform waveforms into log-mel or other spectral features. We show that direct RS is therefore under-specified unless the certified object and preprocessing policy are explicit. On two audio benchmarks, keyword spotting and environmental-sound classification, we study waveform, feature-space, and post-processed smoothing. Our diagnostics show why representation-aware reporting is necessary: at the same smoothing level $σ=0.0025$, the two datasets share the same median raw radius $.007996$, but different waveform energies yield different SNR-equivalent scales ($83.98$ vs. $90.97$ dB); log-mel smoothing gives higher positive-radius certified accuracy on environmental sounds ($68.42\%$ vs. $65.53\%$), certifying more examples with nonzero radius but over features rather than waveforms; and clipping or peak normalization changes the effective perturbation norm by roughly $230$--$351\times$. We therefore recommend that audio RS studies choose and report the task-specific certified object and perturbation model, including the perturbation location, gain policy, raw radius, and any post-noise geometry changes.

URL PDF HTML ☆

赞 0 踩 0

2606.04205 2026-06-04 cs.MM cs.AI cs.CL cs.CV cs.LG cs.SD 版本更新

DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities

DetectZoo：一个用于跨文本、音频和图像模态的AI生成内容检测的统一工具包

Sajad Ebrahimi, Nima Jamali, Bardia Shirsalimian, Kelly McConvey, Wentao Zhang, Jalehsadat Mahdavimoghaddam, Maksym Taranukhin, Maura Grossman, Vered Shwartz, Yuntian Deng, Ebrahim Bagheri

发表机构 * University of Toronto（多伦多大学）； University of Waterloo（滑铁卢大学）； Toronto Metropolitan University（多伦多 Metropolitan 大学）； University of British Columbia（不列颠哥伦比亚大学）； Vector Institute（向量研究所）

AI总结提出DetectZoo，一个首个统一的多模态AI生成内容检测工具包，通过标准化数据预处理、评估流程和集成61个检测器与22个基准数据集，实现公平可重复的基准测试。

详情

AI中文摘要

生成模型的日益普及和能力提升模糊了人类与机器生成内容之间的界限，推动了跨文本、图像和音频检测领域的大量研究。大多数现有的检测器要么是商业软件，要么是开源但带有不兼容的代码库、定制化的预处理、评估协议和评估指标，这使得它们的采用、公平比较和复现变得相当困难。为了解决这一关键差距，我们引入了DetectZoo，这是首个可扩展的工具包，旨在为跨文本、音频和图像模态的AI生成内容检测提供统一接口。DetectZoo标准化了从数据摄取和预处理到模型评估的完整实证流程，为研究人员提供了一个统一的框架来系统地基准测试最先进的检测器。通过将多样的公共数据集和基线检测算法集成到单一的统一API下，我们的工具包促进了严格且可重复的评估。DetectZoo提供了61个检测器的参考实现、22个基准数据集的原生加载器，以及一个标准化的评估流程，通过通用接口报告多个指标。每个检测器都是自包含的，但可通过同一接口访问，自动缓存预训练权重，并复现原始发表的结果。DetectZoo降低了多模态AI取证的入门门槛，使研究人员能够识别跨领域的性能差距，并加速开发鲁棒、可泛化的检测技术。开源仓库和全面文档可在https://github.com/sadjadeb/DetectZoo 获取，且可通过pip install detectzoo安装该包。

英文摘要

The growing popularity and capacity of generative models have eroded the distinction between human and machine-generated content, motivating a growing body of work on detection across text, images, and audio. Most available detectors are either commercial software or, if open-source, come with incompatible codebases with bespoke preprocessing, evaluation protocols, and evaluation metrics, which make their adoption, fair comparison, and reproduction quite difficult. To address this critical gap, we introduce DetectZoo, a first-of-its-kind, extensible toolkit designed to provide a unified interface for AI-generated content detection across text, audio, and image modalities. DetectZoo standardizes the complete empirical pipeline, from data ingestion and preprocessing to model assessment, offering researchers a cohesive framework to benchmark state-of-the-art detectors systematically. By integrating diverse public datasets and baseline detection algorithms under a single, unified API, our toolkit facilitates rigorous and reproducible evaluation. DetectZoo provides reference implementations of 61 detectors, native loaders for 22 benchmark datasets, and a standardized evaluation pipeline that reports multiple metrics through a common interface. Each detector is self-contained yet accessible through the same interface, automatically caches pretrained weights, and reproduces the original published results. DetectZoo lowers the barrier to entry for multi-modal AI forensics, enabling researchers to identify performance gaps across domains and accelerating the development of robust, generalizable detection techniques. The open-source repository and comprehensive documentation are publicly available at https://github.com/sadjadeb/DetectZoo, and the package can be installed via pip install detectzoo.

URL PDF HTML ☆

赞 0 踩 0

2606.04103 2026-06-04 cs.SD cs.AI cs.LG eess.AS 版本更新

The Differentiable Auditory Loop (DAL): An ML Framework for Hyper-Personalized Hearing Aids

可微分听觉环路（DAL）：用于超个性化助听器的机器学习框架

Alejandro Ballesta Rosen, Jason Mikiel-Hunter, Julian Maclaren, Jack Collins, Richard F. Lyon, Simon Carlile

发表机构 * Google Research Australia（谷歌澳大利亚研究实验室）； Macquarie University（麦考瑞大学）

AI总结提出可微分听觉环路（DAL）框架，通过将CARFAC模型移植到JAX并优化SEANet深度神经网络，以正常听觉神经活动模式为参考补偿听力损失，在神经表征和信号保真度指标上优于传统助听器基线。

详情

AI中文摘要

传统助听器依赖固定的频率依赖性放大和压缩来管理灵敏度降低，这在复杂环境中（如多说话者场景，即“鸡尾酒会”问题）往往无法提供足够的听力支持。为了更全面地解决听力损失背后的编码功能障碍，我们引入了可微分听觉环路（DAL），这是一个用于个性化助听器设计和验配的新开源框架。我们的第一个DAL实现包含了CARFAC——一个可微的人类耳蜗功能模型，我们将其移植到JAX，以优化深度神经网络，使受损的听觉神经活动模式与正常听力参考匹配。为了构建具有所需精细频谱-时间信号处理的助听器，我们采用了SEANet，一种波形到波形的全卷积UNet生成器。我们通过比较适配正常听力的CARFAC模型输出与适配每个受试者个体听力损伤的CARFAC模型输出来微调网络。比较使用来自各自CARFAC神经活动模式（NAP）输出和稳定听觉图像（SAI）的损失函数进行，后者提供捕获听觉神经输出中相位不敏感时间结构的二维表示。通过梯度下降，SEANet模型学习同时去噪输入并补偿由受损CARFAC模型建模的听力损失。在神经表征和信号保真度指标上，DAL优化的SEANet模型优于测试的主助听器（MHA）基线。DAL框架为基于模型、机器学习驱动的助听器信号处理个性化提供了一条实用路径。下一步包括硬件部署以实现真实世界的临床测试。

英文摘要

Conventional hearing aids rely on fixed, frequency-dependent amplification and compression to manage reduced sensitivity, which often fails to provide sufficient listening support in complex environments, such as situations with multiple speakers (the ``cocktail party'' problem). To more comprehensively address the underlying encoding dysfunctions of hearing loss, we introduce the Differentiable Auditory Loop (DAL), a new open-source framework for personalized hearing aid design and fitting. Our first implementation of DAL incorporates CARFAC, a differentiable model of human cochlear function, which we ported to JAX, to optimize a deep neural network to match impaired auditory neural activity patterns with a normal-hearing reference. To build a hearing aid with the fine-grained spectro-temporal signal processing required, we adopt SEANet, a waveform-to-waveform fully convolutional UNet generator. We fine-tune the network by comparing the outputs of a CARFAC model fitted to normal hearing with that of a CARFAC model fitted to match each subject's individual hearing impairment. The comparison is done using loss functions derived from the respective CARFAC neural activity pattern (NAP) outputs and stabilized auditory images (SAIs), the latter providing a 2D representation that captures phase-insensitive temporal structure in the auditory nerve output. Through gradient descent, the SEANet model learns to both denoise the input and compensate for the hearing loss modelled by the impaired CARFAC model. Across neural-representation and signal-fidelity metrics, the DAL-optimized SEANet model outperformed the tested master hearing aid (MHA) baselines. The DAL framework provides a practical path toward model-based, machine-learning-driven personalization of hearing aid signal processing. Next steps include hardware deployment to enable real-world clinical testing.

URL PDF HTML ☆

赞 0 踩 0

2606.04040 2026-06-04 cs.SD cs.AI eess.AS 版本更新

Channel-Oriented Design for EEG-to-Music Reconstruction

面向脑电到音乐重建的通道导向设计

Jiaxin Qing, Junwei Lu, Lexin Li

发表机构 * UC Berkeley（加州大学伯克利分校）； Harvard University（哈佛大学）

AI总结针对脑电信号弱、易受噪声和通道变异影响的问题，提出通道导向设计（包括通道级标记化、多视角自蒸馏和数据增强），在编码-对齐-解码流水线中实现稳定的音乐语义空间对齐，显著提升重建性能。

详情

AI中文摘要

脑机接口旨在从神经信号中解码自然刺激，但迄今为止大多数进展集中在视觉和语言领域。本文研究更具挑战性但探索较少的脑电到音乐重建场景，其中信号微弱、分布广泛且极易受噪声和通道变异影响。我们的核心发现是，早期通道混合会破坏微弱但具有判别性的脑电信号。为此，我们提出一种包含三个关键组件的通道导向设计。具体而言，通道级标记化将每个电极视为显式标记以保留空间局部的神经证据，通道级多视角自蒸馏通过时间裁剪和随机通道子集强制一致性以学习鲁棒且分布式的表示，通道级数据增强引入结构化通道丢弃以提高对噪声、伪迹和缺失电极的不变性。这些组件共同保留了跨通道的微弱但信息丰富的信号，并实现了与语义音乐表示空间的稳定对齐。我们将该通道导向设计集成到脑电到音乐重建的编码-对齐-解码流水线中。理论上，我们刻画了何时保留通道级结构能够改善对齐。实验上，我们与一系列最先进的基线方法进行比较，并展示了一致且显著的性能提升。

英文摘要

Brain-computer interfaces aim to decode naturalistic stimuli from neural signals, yet most progress to date has focused on vision and language. In this article, we study a more challenging but far less explored setting, EEG-to-music reconstruction, where signals are weak, distributed, and highly susceptible to noise and channel variability. Our central finding is that early channel mixing destroys weak but discriminative EEG signals. To address this, we propose a channel-oriented design with three key components. Specifically, channel-wise tokenization treats each electrode as an explicit token to retain spatially localized neural evidence, channel-wise multi-view self-distillation enforces consistency across temporal crops and random channel subsets to learn robust and distributed representations, and channel-wise data augmentation introduces structured channel dropout to improve invariance to noise, artifacts, and missing electrodes. Together, these components preserve weak yet informative signals across channels and enable stable alignment to a semantic music representation space. We integrate this channel-oriented design within an encoding-alignment-decoding pipeline for EEG-to-music reconstruction. Theoretically, we characterize when preserving channel-level structure leads to improved alignment. Empirically, we compare with a range of state-of-the-art baselines and demonstrate consistent and significant performance gains.

URL PDF HTML ☆

赞 0 踩 0

2606.03283 2026-06-04 eess.AS cs.SD 版本更新

SpeakerCard-1M: An Evidence-Grounded Speaker Card Corpus for In-the-Wild Speaker Verification

SpeakerCard-1M：面向野外说话人确认的基于证据的说话人卡片语料库

Junyi Peng, Oldřich Plchot, Xiao Song, Dading Chong, Lichun Fan, Hang Su, Themos Stafylakis, Junjie Li, Kong Aik Lee, Shuai Wang, Jian Luan, Jan Černocký

发表机构 * Brno University of Technology, Czechia（布拉格技术大学，捷克）； Peking University, China（北京大学，中国）； Xiaomi, China（小米，中国）； Athens University of Economics and Business, Greece（雅典经济与商业大学，希腊）； The Hong Kong Polytechnic University, Hong Kong（香港理工大学，香港）； Nanjing University, China（南京大学，中国）

AI总结提出SpeakerCard-1M双语说话人资源，通过声学探针和受限LLM生成结构化说话人卡片，并定义跨模态协议以支持基于证据的说话人确认。

Comments Corpus and protocols at https://junyipeng00.github.io/SpeakerCard-1M-page

详情

AI中文摘要

现代说话人确认（SV）系统依赖于说话人嵌入，这些嵌入有效但难以解释或通过自然语言查询。大多数现有的语音-文本语料库针对可控合成或话语级字幕，并为野外说话人识别提供有限的说话人级监督。本文介绍了SpeakerCard-1M，一个面向基于证据的SV的双语说话人中心资源，源自VoxCeleb1/2和CN-Celeb1/2，其中“-1M”后缀指发布中包含的178万条话语级字幕。我们采用工具优先、LLM最后的策略：十个声学探针产生字段级证据，证据在将相对稳定特征与话语级状态分离的模式下聚合成说话人档案，并由仅看到结构化字段的受限LLM渲染出双语说话人卡片。发布内容包括10200个说话人的56700条说话人卡片记录、178万条话语级字幕以及说话人ID不相交的难负三元组。我们进一步定义了两个面向SV的跨模态协议：双向说话人-文本检索（T2S-R / S2T-R）和属性条件验证（AC-Verify），并在零样本强制选择设置下将双编码器基线与最近的音频语言模型进行比较。联合音频-文本训练使VoxCeleb1-O的EER比纯音频基线绝对提高0.31%。在风格对称的LLM生成反事实协议下，八个最近的音频语言模型（7B-30B+参数，包括开源和闭源）在双向强制选择的音高级AC-Verify中得分为49-77%，而我们的双编码器达到88.66%。

英文摘要

Modern speaker verification (SV) systems rely on speaker embeddings that are effective but difficult to interpret or query in natural language. Most existing speech-text corpora target controllable synthesis or utterance-level captioning, and provide limited speaker-level supervision for in-the-wild speaker recognition. This paper introduces SpeakerCard-1M, a bilingual speaker-centric resource for evidence-grounded SV, derived from VoxCeleb1/2 and CN-Celeb1/2, where the "-1M" suffix refers to the 1.78M utterance-level captions contained in the release. We adopt a tool-first, LLM-last approach: ten acoustic probes produce field-level evidence, the evidence is aggregated into speaker profiles under a schema that separates relatively stable traits from utterance-level states, and bilingual Speaker Cards are rendered by a constrained LLM that sees only the structured fields. The release includes 56.7K Speaker Card records over 10.2K speakers, 1.78M utterance-level captions, and speaker-ID-disjoint hard-negative triplets. We further define two SV-oriented cross-modal protocols, bidirectional Speaker-Text Retrieval (T2S-R / S2T-R) and Attribute-Conditioned Verification (AC-Verify), and compare a dual-encoder baseline against recent audio language models under a zero-shot forced-choice setting. Joint audio-text training increases VoxCeleb1-O EER by 0.31% absolute over the audio-only baseline. Under a style-symmetric LLM-generated counterfactual protocol, eight recent audio language models (7B-30B+ parameters, both open- and closed-source) score 49-77% on pitch-level AC-Verify under two-way forced choice, compared with 88.66% reached by our dual encoder.

URL PDF HTML ☆

赞 0 踩 0

2606.01804 2026-06-04 eess.AS cs.SD 版本更新

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

SpeechEditBench：面向指令引导语音编辑的双语多属性基准

Hanlin Zhang, Daxin Tan, Dehua Tao, Xiao Chen, Haochen Tan, Linqi Song

发表机构 * Department of Computer Science, City University of Hong Kong（香港城市大学计算机科学系）； AI Lab, Leibniz Research Center, Huawei（华为莱茵研究院人工智能实验室）

AI总结提出SpeechEditBench双语多属性基准，通过锚点评估协议衡量语音编辑中目标属性的修改成功与非目标属性的保持，发现现有模型在组合编辑任务上表现不佳。

详情

AI中文摘要

指令引导的语音编辑要求模型在修改指定语音属性的同时保持无关特征。尽管语音大语言模型（Speech LLMs）进展迅速，但对该能力的系统评估仍具挑战，因为现有基准分散于孤立的编辑任务。为弥补这一差距，我们引入了 extbf{SpeechEditBench}，一个用于指令引导语音编辑的双语多属性基准。SpeechEditBench包含七个原子编辑任务，以及将多个操作整合到单条指令中的组合编辑任务。我们提出了一种基于锚点的评估协议，分别评估目标属性的编辑成功和未目标属性的保持，从而得出三个指标：目标成功、保持成功和联合成功。利用该基准，我们评估了主流的Speech LLMs和专门的语音编辑系统。结果揭示了三个关键发现：（1）没有单一模型在所有编辑维度上表现良好；（2）闭源Speech LLMs通常优于开源模型；（3）组合编辑仍然极具挑战，即使是最先进的模型也难以实现高联合成功。SpeechEditBench提供了一个严格的诊断框架来识别Speech LLMs的瓶颈，从而促进具有更稳健和精确指令引导编辑能力的下一代Speech LLMs的开发。数据和代码将在接收后发布。

英文摘要

Instruction-guided speech editing requires a model to modify specified speech attributes while preserving unrelated characteristics. Despite rapid progress in Speech Large Language Models (Speech LLMs), systematic evaluation of this capability remains challenging, as existing benchmarks are fragmented across isolated editing tasks. To bridge this gap, we introduce SpeechEditBench, a bilingual multi-attribute benchmark for instruction-guided speech editing. SpeechEditBench encompasses seven atomic editing tasks, as well as compositional editing tasks that integrate multiple operations within a single instruction. We propose an anchor-based evaluation protocol that separately assesses the edit success of target attributes and the preservation of untargeted attributes, leading to three metrics: target success, preservation success, and joint success. Using this benchmark, we evaluate mainstream Speech LLMs and specialized speech editing systems. The results reveal three key findings: (1) no single model performs well across all editing dimensions; (2) closed-source Speech LLMs generally outperform open-source models; (3) compositional editing remains highly challenging, with even the most advanced models struggling to achieve high joint success. SpeechEditBench provides a rigorous diagnostic framework to identify bottlenecks in Speech LLMs, thereby facilitating the development of next-generation Speech LLMs with more robust and precise instruction-guided editing capabilities. Data and code are avaialble at https://github.com/daxintan-cuhk/SpeechEditBench .

URL PDF HTML ☆

赞 0 踩 0

2603.09391 2026-06-04 cs.SD cs.AI eess.AS 版本更新

Physics-Informed Neural Engine Sound Modeling with Differentiable Pulse-Train Synthesis

基于物理信息的神经引擎声音建模与可微分脉冲串合成

Robin Doerfler, Lonce Wyse

发表机构 * GitHub

AI总结提出脉冲串-谐振器（PTR）模型，通过可微分合成架构直接建模发动机脉冲形状和时间结构，利用物理信息归纳偏置提升谐波重建质量并降低总损失。

Comments Revised version; to appear in the Proceedings of the 34th European Signal Processing Conference (EUSIPCO 2026)

详情

AI中文摘要

发动机声音源自连续的排气压力脉冲，而非持续的谐波振荡。虽然神经合成方法通常旨在近似最终的频谱特性，但我们提出直接建模底层脉冲形状和时间结构。我们提出了脉冲串-谐振器（PTR）模型，这是一种可微分合成架构，通过将发动机音频生成为与发动机点火模式对齐的参数化脉冲串，并通过模拟排气声学的递归Karplus-Strong谐振器传播它们。该架构集成了物理信息归纳偏置，包括谐波衰减、热力学音高调制、气门动力学包络、排气系统共振以及导出的发动机运行模式，如节气门操作和减速断油（DFCO）。在三种不同发动机类型（总计7.5小时音频）上验证，PTR在谐波重建上比谐波加噪声基线模型提高了21%，总损失降低了5.7%，同时提供了对应于物理现象的可解释参数。完整的代码、模型权重和音频示例已公开提供。

英文摘要

Engine sounds originate from sequential exhaust pressure pulses rather than sustained harmonic oscillations. While neural synthesis methods typically aim to approximate the resulting spectral characteristics, we propose directly modeling the underlying pulse shapes and temporal structure. We present the Pulse-Train-Resonator (PTR) model, a differentiable synthesis architecture that generates engine audio as parameterized pulse trains aligned to engine firing patterns and propagates them through recursive Karplus-Strong resonators simulating exhaust acoustics. The architecture integrates physics-informed inductive biases including harmonic decay, thermodynamic pitch modulation, valve-dynamics envelopes, exhaust system resonances and derived engine operating modes such as throttle operation and Deceleration Fuel Cutoff (DFCO). Validated on three diverse engine types totaling 7.5 hours of audio, PTR achieves a 21% improvement in harmonic reconstruction and a 5.7% reduction in total loss over a harmonic-plus-noise baseline model, while providing interpretable parameters corresponding to physical phenomena. Complete code, model weights, and audio examples are openly available.

URL PDF HTML ☆

赞 0 踩 0

2603.07584 2026-06-04 cs.SD cs.LG eess.AS 版本更新

Analysis-Driven Procedural Generation of an Engine Sound Dataset with Embedded Control Annotations

分析驱动的发动机声音数据集程序化生成与嵌入式控制注释

Robin Doerfler, Lonce Wyse

发表机构 * rdoerfler

AI总结提出一种分析驱动的框架，通过自适应音高谱分析提取真实录音中的谐波结构，驱动扩展参数化谐波加噪声合成器，生成带有精确时间对齐控制注释的发动机音频数据集，用于数据驱动的发动机声音建模。

Comments To appear in the Proceedings of the 34th European Signal Processing Conference (EUSIPCO 2026)

详情

AI中文摘要

计算发动机声音建模是汽车音频行业的核心，尤其适用于主动声音设计应用和虚拟原型设计。新兴的数据驱动发动机声音合成方法需要大量标准化、干净的音频记录，并带有精确时间对齐的运行状态注释：由于高成本、专用测量设备要求和不可避免的噪声污染，这些数据难以获取。我们提出了一种分析驱动的框架，用于生成带有样本精确控制注释的发动机音频。该方法通过自适应音高谱分析从真实录音中提取谐波结构，进而驱动扩展参数化谐波加噪声合成器。利用该框架，我们通过多样化的控制轨迹和参数变化，将每个发动机的5-10分钟源音频扩展15-30倍，生成程序化发动机声音数据集（19.0小时，5,935个文件）：一组带有样本精确RPM和扭矩注释的发动机音频信号，覆盖广泛的工作条件、信号复杂度和谐波轮廓。与真实录音的对比验证了合成数据保留了特征谐波结构，基于该数据集训练的基线可微合成网络证实了其适用于数据驱动的发动机声音建模。该数据集已公开发布，以支持发动机音色分析、控制参数估计和神经生成合成的研究。

英文摘要

Computational engine sound modeling is central to the automotive audio industry, particularly for active sound design applications and virtual prototyping. Emerging data-driven engine sound synthesis methods require large volumes of standardized, clean audio recordings with precisely time-aligned operating-state annotations: data that is difficult to obtain due to high costs, specialized measurement equipment requirements, and inevitable noise contamination. We present an analysis-driven framework for generating engine audio with sample-accurate control annotations. The method extracts harmonic structures from real recordings through pitch-adaptive spectral analysis, which then drive an extended parametric harmonic-plus-noise synthesizer. With this framework, we augment 5-10 min of source audio per engine 15-30x via diverse control trajectories and parametric variation, producing the Procedural Engine Sounds Dataset (19.0 h, 5,935 files): a set of engine audio signals with sample-accurate RPM and torque annotations spanning a wide range of operating conditions, signal complexities, and harmonic profiles. Comparison against real recordings validates that the synthesized data preserves characteristic harmonic structures, and a baseline differentiable synthesis network trained on the dataset confirms its suitability for data-driven engine sound modeling. The dataset is released publicly to support research on engine timbre analysis, control parameter estimation, and neural generative synthesis.

URL PDF HTML ☆

赞 0 踩 0

2508.08237 2026-06-04 cs.MM cs.AI cs.CV cs.SD eess.AS 版本更新

VGGSounder: Audio-Visual Evaluations for Foundation Models

VGGSounder：基础模型的音视频评估

Daniil Zverev, Thaddäus Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, A. Sophia Koepke

发表机构 * Technical University of Munich, MCML（慕尼黑技术大学，MCML）； University of Tübingen（图宾根大学）； Tübingen AI Center（图宾根人工智能中心）； MPI for Intelligent Systems, ELLIS Institute（智能系统Max Planck研究所，ELLIS研究所）

AI总结针对VGGSound数据集在音视频基础模型评估中的标签不完整、类别重叠和模态错位等问题，提出重新标注的多标签测试集VGGSounder，并引入模态混淆指标分析模型性能退化。

Comments Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 2025

2509.21597 2026-06-04 eess.AS cs.CL cs.SD 版本更新

AUDDT: A Unified Benchmark Toolkit for Audio and Speech Deepfake Detectors

AUDDT：音频与语音深度伪造检测器的统一基准工具包

Yi Zhu, Heitor R. Guimarães, Arthur Pimentel, Tiago Falk

发表机构 * MuSAELab（MuSAELab实验室）

AI总结本文提出AUDDT开源基准工具包，通过整合31个数据集并自动化评估预训练检测器，系统分析了深度伪造检测在不同操作类型和录音条件下的泛化能力与性能差异。

详情

AI中文摘要

随着人工智能生成内容（如音频深度伪造）的普及，近期大量工作聚焦于开发深度伪造检测技术。然而，现有基准仅使用少量数据集，使得检测器在真实世界条件下的泛化能力不确定。本文系统回顾了31个现有音频深度伪造数据集，并提出了一个名为AUDDT（https://github.com/MuSAELab/AUDDT）的开源基准测试工具包。该工具包旨在自动化评估预训练检测器在广泛语音和非语音音频数据集上的性能，为用户提供其深度伪造检测器在不同操作类型和录音条件下的优缺点直接反馈。我们首先展示了所开发工具包的使用方法、基准的组成以及不同深度伪造子组的细分。接着，我们强调了AUDDT与现有基准工作的不同之处，即通过大规模、多样化的现代欺骗方法评估以及通过全面的元数据注释进行更丰富的属性级分析。使用一个广泛采用的预训练深度伪造检测器，我们展示了域内和域外检测结果，揭示了在不同条件和音频操作类型下显著的性能差异。最后，我们还分析了这些现有数据集的局限性及其与实际部署场景之间的差距。

英文摘要

With the prevalence of artificial intelligence (AI)-generated content, such as audio deepfakes, a large body of recent work has focused on developing deepfake detection techniques. However, existing benchmarks employ a narrow set of datasets, leaving detector generalization to real-world conditions uncertain. In this paper, we systematically review 31 existing audio deepfake datasets and present an open-source benchmarking toolkit called AUDDT (https://github.com/MuSAELab/AUDDT). The goal of this toolkit is to automate the evaluation of pretrained detectors across a wide range of speech and non-speech audio datasets, giving users direct feedback on the advantages and shortcomings of their deepfake detectors under diverse manipulation types and recording conditions. We start by showcasing the usage of the developed toolkit, the composition of our benchmark, and the breakdown of different deepfake subgroups. Next, we highlight how AUDDT differs from existing benchmarking efforts by enabling large-scale, diverse evaluation across modern spoofing methods and richer attribute-level analysis through comprehensive metadata annotation. Using a widely adopted pretrained deepfake detector, we present in- and out-of-domain detection results, revealing notable performance variability across different conditions and audio manipulation types. Lastly, we also analyze the limitations of these existing datasets and their gaps relative to practical deployment scenarios.

URL PDF HTML ☆

赞 0 踩 0

2508.14623 2026-06-04 eess.AS cs.AI cs.SD 版本更新

A Study of the Scale Invariant Signal to Distortion Ratio in Speech Separation with Noisy References

带噪参考下语音分离中尺度不变信失真比的研究

Simon Dahl Jepsen, Mads Græsbøll Christensen, Jesper Rindom Jensen

发表机构 * European Union（欧洲联盟）

AI总结本文研究了在训练参考包含噪声时，使用尺度不变信失真比作为评估和训练目标的影响，提出通过增强参考和混合数据来避免学习噪声参考，实验表明可减少分离语音中的噪声但可能引入伪影。

详情

DOI: 10.1109/ASRU65441.2025.11434756
Journal ref: 2025 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Honolulu, HI, USA, 2025, pp. 1-8

AI中文摘要

本文研究了在监督语音分离中，当训练参考包含噪声时（如事实上的基准WSJ0-2Mix），使用尺度不变信失真比（SI-SDR）作为评估和训练目标的影响。对带噪参考的SI-SDR推导表明，噪声限制了可实现的SI-SDR，或导致分离输出中出现不希望的噪声。为了解决这个问题，提出了一种增强参考并用WHAM!扩充混合数据的方法，旨在训练避免学习噪声参考的模型。使用非侵入式NISQA.v2指标评估了在这些增强数据集上训练的两个模型。结果显示分离语音中的噪声减少，但表明处理参考可能引入伪影，限制了整体质量提升。在WSJ0-2Mix和Libri2Mix测试集上，各模型的SI-SDR与感知噪声之间存在负相关，这印证了推导的结论。

英文摘要

This paper examines the implications of using the Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) as both evaluation and training objective in supervised speech separation, when the training references contain noise, as is the case with the de facto benchmark WSJ0-2Mix. A derivation of the SI-SDR with noisy references reveals that noise limits the achievable SI-SDR, or leads to undesired noise in the separated outputs. To address this, a method is proposed to enhance references and augment the mixtures with WHAM!, aiming to train models that avoid learning noisy references. Two models trained on these enhanced datasets are evaluated with the non-intrusive NISQA.v2 metric. Results show reduced noise in separated speech but suggest that processing references may introduce artefacts, limiting overall quality gains. Negative correlation is found between SI-SDR and perceived noisiness across models on the WSJ0-2Mix and Libri2Mix test sets, underlining the conclusion from the derivation.

URL PDF HTML ☆

赞 0 踩 0

1905.03632 2026-06-04 cs.SD cs.SY eess.AS eess.SY 版本更新

Block-Online Multi-Channel Speech Enhancement Using DNN-Supported Relative Transfer Function Estimates

基于DNN的块在线多通道语音增强方法

Jiri Malek, Zbynek Koldovsky, Marek Bohac

发表机构 * Faculty of Mechatronics, Informatics, and Interdisciplinary Studies, Technical University of Liberec（机械电子与交叉学科学院，利贝雷茨技术大学）

AI总结本文提出了一种基于DNN的块在线多通道语音增强方法，通过估计相对传输函数来实现波束成形，并在动态环境中处理短语音，提升了语音增强的鲁棒性。

Comments 10 pages, 8 figures, 4 tables. Modified version of the article accepted for publication in IET Signal Processing journal. Original results unchanged, additional experiments presented, refined discussion and conclusions

详情

DOI: 10.1049/iet-spr.2019.0304
Journal ref: IET Signal Processing, vol. 14, no. 3, pp. 124-133, May 2020

AI中文摘要

本文解决多通道语音增强中的块在线处理问题。此类处理在移动说话人或处理极短语音（如语音助手场景）时至关重要。我们考虑了一种系统，该系统通过基于DNN的语音活动检测（VAD）进行波束成形，随后进行后滤波。通过估计麦克风之间的相对传输函数来定位说话人。输入信号的每个块独立处理，以使其适用于高度动态的环境。由于处理块长度较短，波束成形所需的统计信息估计不够精确。本研究分析了这种不精确性的影响，并将其与将记录视为单块（批量处理）的处理模式进行比较。所提出方法在CHiME-4大型数据集和另一个具有移动目标说话人数据集上进行了实验评估。评估基于客观和主观标准（如信号干扰比（SIR）或语音质量主观评价（PESQ））。此外，还评估了基于基线自动语音识别系统的词错误率（WER），其中增强方法作为前端解决方案。结果表明，所提出的方法在处理块长度较短时具有鲁棒性。即使在250毫秒的块长度下，也能在各项指标和WER上观察到显著改进。

英文摘要

This work addresses the problem of block-online processing for multi-channel speech enhancement. Such processing is vital in scenarios with moving speakers and/or when very short utterances are processed, e.g., in voice assistant scenarios. We consider several variants of a system that performs beamforming supported by DNN-based voice activity detection (VAD) followed by post-filtering. The speaker is targeted through estimating relative transfer functions between microphones. Each block of the input signals is processed independently in order to make the method applicable in highly dynamic environments. Owing to the short length of the processed block, the statistics required by the beamformer are estimated less precisely. The influence of this inaccuracy is studied and compared to the processing regime when recordings are treated as one block (batch processing). The experimental evaluation of the proposed method is performed on large datasets of CHiME-4 and on another dataset featuring moving target speaker. The experiments are evaluated in terms of objective and perceptual criteria (such as signal-to-interference ratio (SIR) or perceptual evaluation of speech quality (PESQ), respectively). Moreover, word error rate (WER) achieved by a baseline automatic speech recognition system is evaluated, for which the enhancement method serves as a front-end solution. The results indicate that the proposed method is robust with respect to short length of the processed block. Significant improvements in terms of the criteria and WER are observed even for the block length of 250 ms.

URL PDF HTML ☆

赞 0 踩 0

1609.03213 2026-06-04 cs.SD cs.SY eess.SY 版本更新

Relaxed Binaural LCMV Beamforming

放松的双耳LCMV波束成形

Andreas I. Koutrouvelis, Richard C. Hendriks, Richard Heusdens, Jesper Jensen

发表机构 * Oticon Foundation（奥蒂康基金会）

AI总结本文提出了一种新的双耳波束成形技术，该技术可以看作是线性约束最小方差（LCMV）框架的放松。该方法能够同时实现噪声抑制和目标源双耳线索的精确保持，类似于双耳最小方差无失真响应（BMVDR）方法。然而，与BMVDR不同，该方法还能以预定义的精度保持多个干扰源的双耳线索。具体来说，它通过为每个干扰源使用独立的权衡参数来控制噪声抑制和干扰源双耳线索保持之间的权衡。此外，我们提供了一种稳健的方法来选择这些权衡参数，使得干扰源的双耳线索保持精度始终优于BMVDR的相应精度。所提出方法中约束的放松实现了比其他使用严格等式约束的LCMV基双耳波束成形方法更接近的更多干扰源的近似双耳线索保持。

详情

DOI: 10.1109/TASLP.2016.2628642
Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(1), 137-152, 2016

AI中文摘要

在本文中，我们提出了一种新的双耳波束成形技术，该技术可以看作是线性约束最小方差（LCMV）框架的放松。所提出的方法能够实现同时的噪声抑制和目标源的双耳线索的精确保持，类似于双耳最小方差无失真响应（BMVDR）方法。然而，与BMVDR不同，该方法还能够以一定的预定义精度保持多个干扰源的双耳线索。具体来说，它通过为每个干扰源使用独立的权衡参数来控制噪声抑制和干扰源双耳线索保持之间的权衡。此外，我们提供了一种稳健的方法来选择这些权衡参数，使得干扰源的双耳线索保持精度始终优于BMVDR的相应精度。所提出方法中约束的放松实现了比其他使用严格等式约束的LCMV基双耳波束成形方法更接近的更多干扰源的近似双耳线索保持。

英文摘要

In this paper we propose a new binaural beamforming technique which can be seen as a relaxation of the linearly constrained minimum variance (LCMV) framework. The proposed method can achieve simultaneous noise reduction and exact binaural cue preservation of the target source, similar to the binaural minimum variance distortionless response (BMVDR) method. However, unlike BMVDR, the proposed method is also able to preserve the binaural cues of multiple interferers to a certain predefined accuracy. Specifically, it is able to control the trade-off between noise reduction and binaural cue preservation of the interferers by using a separate trade-off parameter per interferer. Moreover, we provide a robust way of selecting these trade-off parameters in such a way that the preservation accuracy for the binaural cues of the interferers is always better than the corresponding ones of the BMVDR. The relaxation of the constraints in the proposed method achieves approximate binaural cue preservation of more interferers than other previously presented LCMV-based binaural beamforming methods that use strict equality constraints.

URL PDF HTML ☆

赞 0 踩 0

1602.05702 2026-06-04 cs.SD cs.SY eess.SY stat.ML 版本更新

EEG-informed attended speaker extraction from recorded speech mixtures with application in neuro-steered hearing prostheses

基于EEG的受关注说话人提取从记录的语音混音中，应用于神经引导的听力假体

Simon Van Eyndhoven, Tom Francart, Alexander Bertrand

发表机构 * ESAT Laboratory of KU Leuven（KU莱顿大学ESAT实验室）； KU Leuven（KU莱顿大学）； Department of Electrical Engineering (ESAT)（电气工程系（ESAT））； Department of Neurosciences（神经科学系）

AI总结本文提出了一种基于EEG的受关注说话人提取方法，利用麦克风阵列记录和EEG记录来实现噪声环境下的说话人分离与去噪，展示了在无干净语音信号的情况下，通过EEG进行的听觉注意力检测的鲁棒性。

Comments This paper is published in IEEE Transactions on Biomedical Engineering (2016) and is under copyright. Please cite this paper as: S. Van Eyndhoven, T. Francart, and A. Bertrand, "EEG-informed attended speaker extraction from recorded speech mixtures with application in neuro-steered hearing prostheses", IEEE Transactions on Biomedical Engineering, vol. 64, no. 5, pp. 1045-1056, 2017

详情

DOI: 10.1109/TBME.2016.2587382
Journal ref: IEEE Transactions on Biomedical Engineering, vol. 64, no. 5, pp. 1045-1056, 2017

AI中文摘要

OBJECTIVE: 我们的目标是提取并去噪嘈杂双说话人声学场景中的受关注说话人，依靠双耳助听器的麦克风阵列记录，这些记录通过脑电图（EEG）记录补充，以推断感兴趣的说话人。 METHODS: 在本研究中，我们提出了一种模块化处理流程，首先从麦克风记录中提取两个语音包络，然后根据EEG选择受关注的语音包络，最后使用该包络来指导多通道语音分离和去噪算法。 RESULTS: 实现了对干扰（未受关注）语音和背景噪声的有效抑制，同时保留了受关注的语音。此外，基于EEG的听觉注意力检测（AAD）被证明在使用噪声语音信号时具有鲁棒性。 CONCLUSIONS: 我们的结果表明，基于EEG的说话人提取从麦克风阵列记录是可行且鲁棒的，即使在嘈杂的声学环境中，且无需访问干净的语音信号来执行基于EEG的AAD。 SIGNIFICANCE: 当前关于AAD的研究总是假设干净语音信号的可用性，这限制了在真实环境中的应用。我们扩展了这项研究，即使只有麦克风记录和嘈杂语音混合物可用时，也能检测到受关注的说话人。这是为新的脑机接口和有效的过滤方案在神经引导的听力假体中提供了一个关键要素。在这里，我们提供了基于EEG的受关注说话人提取和去噪的第一个概念验证。

英文摘要

OBJECTIVE: We aim to extract and denoise the attended speaker in a noisy, two-speaker acoustic scenario, relying on microphone array recordings from a binaural hearing aid, which are complemented with electroencephalography (EEG) recordings to infer the speaker of interest. METHODS: In this study, we propose a modular processing flow that first extracts the two speech envelopes from the microphone recordings, then selects the attended speech envelope based on the EEG, and finally uses this envelope to inform a multi-channel speech separation and denoising algorithm. RESULTS: Strong suppression of interfering (unattended) speech and background noise is achieved, while the attended speech is preserved. Furthermore, EEG-based auditory attention detection (AAD) is shown to be robust to the use of noisy speech signals. CONCLUSIONS: Our results show that AAD-based speaker extraction from microphone array recordings is feasible and robust, even in noisy acoustic environments, and without access to the clean speech signals to perform EEG-based AAD. SIGNIFICANCE: Current research on AAD always assumes the availability of the clean speech signals, which limits the applicability in real settings. We have extended this research to detect the attended speaker even when only microphone recordings with noisy speech mixtures are available. This is an enabling ingredient for new brain-computer interfaces and effective filtering schemes in neuro-steered hearing prostheses. Here, we provide a first proof of concept for EEG-informed attended speaker extraction and denoising.

URL PDF HTML ☆

赞 0 踩 0

1812.03279 2026-06-04 cs.SD cs.NA eess.AS math.NA 版本更新

Estimates of the Reconstruction Error in Partially Redressed Warped Frames Expansions

部分修正扭曲框架展开中重构误差的估计

Thomas Mejstrik, Gianpaolo Evangelista

发表机构 * University of Vienna（维也纳大学）； MDW, University of Music and Performing Arts Vienna（维也纳音乐与表演艺术大学MDW）

AI总结本文研究了具有紧支撑的频率扭曲分析-合成元素的近似误差，通过几个例子和案例研究，探讨了在在线计算中如何通过有限时间支持的近似来减少重构误差。

Comments 8 pages, 5 figures, 4 tables, conference paper

详情

Journal ref: Proc. of Digital Audio Effect Conf. (DAFx'16). Brno, Czech Republic, September 2016, pp. 9-16

AI中文摘要

在最近的工作中，引入了修正扭曲框架，用于非均匀频率和时间分辨率音频信号的分析和合成。在这些框架中，表示元素的频率带或时间间隔的分配可以通过扭曲映射唯一描述。在时间-频率采样后应用反扭曲可以减少或消除扭曲框架元素在共轭变量中的色散，从而可能通过频率构造具有同步时间对齐的频率扭曲框架。然而，修正过程仅在分析和合成窗口在应用扭曲的域中具有紧支撑时才是精确的。这暗示频率扭曲框架不能在时间域中具有紧支撑。当需要在线计算时，这一性质是不理想的。然而，允许时间支持为有限的近似是可能的，这导致较小的重构误差。在本文中，我们研究了具有紧支撑的频率扭曲分析-合成元素的近似误差，提供了几个例子和案例研究。

英文摘要

In recent work, redressed warped frames have been introduced for the analysis and synthesis of audio signals with non-uniform frequency and time resolutions. In these frames, the allocation of frequency bands or time intervals of the elements of the representation can be uniquely described by means of a warping map. Inverse warping applied after time-frequency sampling provides the key to reduce or eliminate dispersion of the warped frame elements in the conjugate variable, making it possible, e.g., to construct frequency warped frames with synchronous time alignment through frequency. The redressing procedure is however exact only when the analysis and synthesis windows have compact support in the domain where warping is applied. This implies that frequency warped frames cannot have compact support in the time domain. This property is undesirable when online computation is required. Approximations in which the time support is finite are however possible, which lead to small reconstruction errors. In this paper we study the approximation error for compactly supported frequency warped analysis-synthesis elements, providing a few examples and case studies.

URL PDF HTML ☆

赞 0 踩 0

1705.03342 2026-06-04 math.NA cs.NA cs.SD 版本更新

On the eigenmodes of periodic orbits for multiple scattering problems in 2D

关于二维多重散射问题中周期轨道的本征模

Daan Huybrechs, Peter Opsomer

发表机构 * Department of Computer Science（计算机科学系）； KU Leuven（库勒韦恩大学）

AI总结本文研究了二维多重散射问题中周期轨道的本征模，提出了一种基于边界积分方程的渐近方法，通过泰勒展开近似相位，以加速射线追踪方案。

Comments 24 pages, 9 figures and the implementation is available on https://github.com/popsomer/asyBEM/releases

详情

AI中文摘要

波传播和声学散射问题需要大量计算资源来在高频下准确求解。渐近方法可以通过显式提取解的振荡特性使成本可能与频率无关。然而，在存在多个散射障碍物时，高频波模式变得非常复杂。我们考虑了涉及多个障碍物的二维亥姆霍兹方程的边界积分方程形式，其中已提出射线追踪方案。现有的射线追踪方案分析集中在周期轨道之间的一组障碍物之间。观察到每个障碍物上的密度在几次迭代后趋于平衡。在本文中，我们以泰勒级数形式给出了这些密度相位的渐近近似。密度代表了周期轨道中的完整反射周期。我们最初利用对称性处理两个圆形散射体的情况，但还为任意数量的一般二维障碍物提供了显式算法。系数以及计算它们的时间与波数和入射波无关。这些结果可用于在少量初始迭代后加速射线追踪方案。

英文摘要

Wave propagation and acoustic scattering problems require vast computational resources to be solved accurately at high frequencies. Asymptotic methods can make this cost potentially frequency independent by explicitly extracting the oscillatory properties of the solution. However, the high-frequency wave pattern becomes very complicated in the presence of multiple scattering obstacles. We consider a boundary integral equation formulation of the Helmholtz equation in two dimensions involving several obstacles, for which ray tracing schemes have been previously proposed. The existing analysis of ray tracing schemes focuses on periodic orbits between a subset of the obstacles. One observes that the densities on each of the obstacles converge to an equilibrium after a few iterations. In this paper we present an asymptotic approximation of the phases of those densities in equilibrium, in the form of a Taylor series. The densities represent a full cycle of reflections in a periodic orbit. We initially exploit symmetry in the case of two circular scatterers, but also provide an explicit algorithm for an arbitrary number of general 2D obstacles. The coefficients, as well as the time to compute them, are independent of the wavenumber and of the incident wave. The results may be used to accelerate ray tracing schemes after a small number of initial iterations.

URL PDF HTML ☆

赞 0 踩 0

1606.09178 2026-06-04 math.NA cs.NA cs.SD 版本更新

High-frequency asymptotic compression of dense BEM matrices for general geometries without ray tracing

高频率渐近压缩密集BEM矩阵以适应一般几何形状而无需射线追踪

Daan Huybrechs, Peter Opsomer

发表机构 * Department of Computer Science（计算机科学系）； KU Leuven（根特大学）

AI总结本文提出了一种基于渐近压缩的BEM矩阵方法，通过显式局部化格林函数来减少计算规模，提高矩阵-向量乘积速度和条件数，适用于复杂几何形状的高频声学问题。

Comments 24 pages, 13 figures

详情

AI中文摘要

声学中的波传播和散射问题通常通过边界元方法求解。它们导致一个通常密集且大的离散化矩阵：其大小和条件数随着频率的增加而增长。然而，高频散射问题本质上是局部的，这很好地由高度局部化的射线反弹表示。渐进方法可以用来减少线性系统的规模，甚至使其频率无关，通过显式提取解的振荡特性来实现，使用射线追踪或类似技术。然而，在存在（多个）散射障碍物的复杂几何形状中，射线追踪变得昂贵或难以处理。在本文中，我们从构造完全解析的大而密集矩阵的相同离散化开始，通过显式局部化格林函数来实现渐进压缩。这导致一个大但稀疏的矩阵，具有更快的矩阵-向量乘积速度，并且如数值实验所示，条件数显著提高。尽管适当的格林函数局部化也取决于一般几何形状中不可用的渐进信息，我们可以在频率扫描从小到大频率的过程中自适应地构造它，这种方式会自动考虑一般入射波。我们证明了该方法对非凸、多散射和甚至近陷阱域具有鲁棒性，尽管在后者情况下压缩率明显较低。此外，尽管其渐进性质，该方法对低阶离散化如分段常数、线性或立方体，通常在应用中使用，具有鲁棒性。另一方面，我们没有减少与传统经典离散化相比的总自由度数量。该方法的组合...

英文摘要

Wave propagation and scattering problems in acoustics are often solved with boundary element methods. They lead to a discretization matrix that is typically dense and large: its size and condition number grow with increasing frequency. Yet, high frequency scattering problems are intrinsically local in nature, which is well represented by highly localized rays bouncing around. Asymptotic methods can be used to reduce the size of the linear system, even making it frequency independent, by explicitly extracting the oscillatory properties from the solution using ray tracing or analogous techniques. However, ray tracing becomes expensive or even intractable in the presence of (multiple) scattering obstacles with complicated geometries. In this paper, we start from the same discretization that constructs the fully resolved large and dense matrix, and achieve asymptotic compression by explicitly localizing the Green's function instead. This results in a large but sparse matrix, with a faster associated matrix-vector product and, as numerical experiments indicate, a much improved condition number. Though an appropriate localisation of the Green's function also depends on asymptotic information unavailable for general geometries, we can construct it adaptively in a frequency sweep from small to large frequencies in a way which automatically takes into account a general incident wave. We show that the approach is robust with respect to non-convex, multiple and even near-trapping domains, though the compression rate is clearly lower in the latter case. Furthermore, in spite of its asymptotic nature, the method is robust with respect to low-order discretizations such as piecewise constants, linears or cubics, commonly used in applications. On the other hand, we do not decrease the total number of degrees of freedom compared to a conventional classical discretization. The combination of the ...

URL PDF HTML ☆

赞 0 踩 0

1707.03138 2026-06-04 math.NA cs.NA cs.SD 版本更新

Adaptive synchrosqueezing based on a quilted short-time Fourier transform

基于拼接短时傅里叶变换的自适应同步压缩变换

Alexander Berrian, Naoki Saito

AI总结本文提出一种基于拼接短时傅里叶变换的自适应同步压缩变换，通过多窗口适应不同时间频率区域的行为，提升信号分析精度，尤其在噪声环境下表现优异，并应用于动物叫声分析。

Comments 20 pages, 6 figures, submitted to Proceedings of the SPIE Conference on Wavelets and Sparsity XVII (2017)

详情

AI中文摘要

近年来，同步压缩变换（SST）作为一种分析可分解为多个由瞬时幅度和相位确定的信号的方法受到关注。基于短时傅里叶变换（STFT）的一种SST版本能够锐化由STFT得出的瞬时频率（IF）信息，并分离对应不同IF曲线的幅度-相位成分。然而，这种SST受限于基础窗函数的时间-频率分辨率，可能无法准确解析具有多样化时间-频率行为的信号。本文开发了一种基于“拼接”短时傅里叶变换（SST-QSTFT）的框架，通过使用多个窗口适应不同时间-频率区域的信号行为。这促使我们引入基于相位谱有限差分的离散重新分配频率公式，以确保更广泛窗口的计算准确性。我们建立了SST-QSTFT在连续和离散设置中的理论框架，并描述了根据感兴趣区域自动选择最佳窗口的算法。通过合成数据，我们展示了SST-QSTFT在噪声环境下的优越数值性能。最后，我们应用SST-QSTFT于动物叫声录音，以证明该方法在分析真实生物声学信号中的潜力。

英文摘要

In recent years, the synchrosqueezing transform (SST) has gained popularity as a method for the analysis of signals that can be broken down into multiple components determined by instantaneous amplitudes and phases. One such version of SST, based on the short-time Fourier transform (STFT), enables the sharpening of instantaneous frequency (IF) information derived from the STFT, as well as the separation of amplitude-phase components corresponding to distinct IF curves. However, this SST is limited by the time-frequency resolution of the underlying window function, and may not resolve signals exhibiting diverse time-frequency behaviors with sufficient accuracy. In this work, we develop a framework for an SST based on a "quilted" short-time Fourier transform (SST-QSTFT), which allows adaptation to signal behavior in separate time-frequency regions through the use of multiple windows. This motivates us to introduce a discrete reassignment frequency formula based on a finite difference of the phase spectrum, ensuring computational accuracy for a wider variety of windows. We develop a theoretical framework for the SST-QSTFT in both the continuous and the discrete settings, and describe an algorithm for the automatic selection of optimal windows depending on the region of interest. Using synthetic data, we demonstrate the superior numerical performance of SST-QSTFT relative to other SST methods in a noisy context. Finally, we apply SST-QSTFT to audio recordings of animal calls to demonstrate the potential of our method for the analysis of real bioacoustic signals.

URL PDF HTML ☆

赞 0 踩 0

1603.04179 2026-06-04 cs.SD cs.IT cs.SY eess.SY math.IT 版本更新

Performance Analysis of Source Image Estimators in Blind Source Separation

源图像估计器在盲源分离中的性能分析

Zbyněk Koldovský, Francesco Nesta

AI总结本文分析了两种常用的传感器响应计算方法，探讨其在正交信号子空间下的等价性及应用差异。

Comments 24 pages

详情

DOI: 10.1109/TSP.2017.2709269

AI中文摘要

盲方法通常在未知缩放因子下分离或识别信号或信号子空间。有时需要处理缩放模糊性，可通过重建传感器接收到的信号来解决，因为传感器响应的尺度有已知的物理解释。本文分析了两种广泛用于计算传感器响应的方法，特别是频域独立成分分析。一种方法是最小二乘投影，另一种假设正则混叠矩阵并计算其逆。这两种估计器对未知缩放因子具有不变性。尽管经常使用，但它们的差异尚未被研究。本文的目标是填补这一空白。通过理论研究、扰动分析和模拟比较这两种估计器。我们指出，当分离的信号子空间正交时，估计器等价，反之亦然。展示了两个应用，其中一个案例显示估计器产生显著不同的结果。

英文摘要

Blind methods often separate or identify signals or signal subspaces up to an unknown scaling factor. Sometimes it is necessary to cope with the scaling ambiguity, which can be done through reconstructing signals as they are received by sensors, because scales of the sensor responses (images) have known physical interpretations. In this paper, we analyze two approaches that are widely used for computing the sensor responses, especially, in Frequency-Domain Independent Component Analysis. One approach is the least-squares projection, while the other one assumes a regular mixing matrix and computes its inverse. Both estimators are invariant to the unknown scaling. Although frequently used, their differences were not studied yet. A goal of this work is to fill this gap. The estimators are compared through a theoretical study, perturbation analysis and simulations. We point to the fact that the estimators are equivalent when the separated signal subspaces are orthogonal, and vice versa. Two applications are shown, one of which demonstrates a case where the estimators yield substantially different results.

URL PDF HTML ☆

赞 0 踩 0

1606.07729 2026-06-04 eess.SY cs.SD cs.SY 版本更新

On Lossless Feedback Delay Networks

无损反馈延迟网络

Sebastian J. Schlecht, Emanuel A. P. Habets

AI总结本文研究无损反馈延迟网络的通用反馈矩阵，证明当反馈矩阵的不可约成分与单位矩阵对角相似时，无论延迟长度如何，网络均为无损。

1605.01805 2026-06-04 physics.data-an cs.NA cs.SD math.NA 版本更新

Wave-shape function analysis -- when cepstrum meets time-frequency analysis

波形函数分析--当谱分析遇见时频分析

Chen-Yun Lin, Li Su, Hau-tieng Wu

AI总结本文提出结合谱分析与非线性时频分析，用于研究具有时变频率、幅度及非正弦振荡模式的多分量振荡信号。提出去波形同步压缩变换算法，理论分析并验证其有效性。

1606.00785 2026-06-04 cs.SD cs.NA math.NA 版本更新

Piano Transcription in the Studio Using an Extensible Alternating Directions Framework

使用可扩展的交替方向框架进行录音室钢琴转录

Sebastian Ewert, Mark Sandler

AI总结本文提出一种基于可扩展交替方向法的新型信号模型，通过利用单音录制信息来提高钢琴等 pitched percussive instruments 的转录准确性，达到93-95%的f-measure。

Comments IEEE/ACM Transactions on Audio, Speech, and Language Processing

详情

DOI: 10.1109/TASLP.2016.2593801
Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 1983-1997, 2016

AI中文摘要

给定一首音乐音频录音，自动音乐转录的目标是确定录音下所包含乐曲的乐谱式表示。尽管在研究社区中存在显著兴趣，但一些研究报告指出存在一个'玻璃天花板'效应，即当前方法似乎无法克服的转录准确性限制。本文探讨如何通过专注于特定乐器类别并利用录音条件下可用的额外信息来缓解这一效应。特别是，利用所用乐器的单音录制信息，开发了一种新的信号模型，其核心构建块是可变长度的频时域模式——专为 pitched percussive instruments 如钢琴设计。频谱模板之间的时序依赖性被建模，类似于因子缩放隐马尔可夫模型（FS-HMM）和其他结合非负矩阵分解与马尔可夫过程的方法。与FS-HMMs不同，我们的参数估计是在可扩展交替方向法（ADMM）框架内以全局、放松的形式开发的，这使得能够系统地结合基本正则化器传播稀疏性和局部平稳性在音符活动上，与更复杂的正则化器施加时间语义。所提出的方法在Yamaha Disklavier（MAPS DB）录制的乐曲上达到93-95%的f-measure。

英文摘要

Given a musical audio recording, the goal of automatic music transcription is to determine a score-like representation of the piece underlying the recording. Despite significant interest within the research community, several studies have reported on a 'glass ceiling' effect, an apparent limit on the transcription accuracy that current methods seem incapable of overcoming. In this paper, we explore how much this effect can be mitigated by focusing on a specific instrument class and making use of additional information on the recording conditions available in studio or home recording scenarios. In particular, exploiting the availability of single note recordings for the instrument in use we develop a novel signal model employing variable-length spectro-temporal patterns as its central building blocks - tailored for pitched percussive instruments such as the piano. Temporal dependencies between spectral templates are modeled, resembling characteristics of factorial scaled hidden Markov models (FS-HMM) and other methods combining Non-Negative Matrix Factorization with Markov processes. In contrast to FS-HMMs, our parameter estimation is developed in a global, relaxed form within the extensible alternating direction method of multipliers (ADMM) framework, which enables the systematic combination of basic regularizers propagating sparsity and local stationarity in note activity with more complex regularizers imposing temporal semantics. The proposed method achieves an f-measure of 93-95% for note onsets on pieces recorded on a Yamaha Disklavier (MAPS DB).

URL PDF HTML ☆

赞 0 踩 0

1606.09047 2026-06-04 cs.MM cs.NA cs.SD math.NA 版本更新

Minimum-latency Time-frequency Analysis Using Asymmetric Window Functions

基于非对称窗函数的最小延迟时频分析

Li Su, Hau-tieng Wu

AI总结本文提出一种基于非对称窗函数的时频分析方法，用于在保证最小延迟的情况下从时间序列中提取实时动态，通过理论证明SST结合非对称窗函数可降低固有延迟，并通过音乐起始检测验证了算法有效性。

Comments 29 pages, 7 figures

1606.06154 2026-06-04 cs.CE cs.SD cs.SY eess.SY 版本更新

Closed Form Fractional Integration and Differentiation via Real Exponentially Spaced Pole-Zero Pairs

通过实指数等距极零对实现分数积分和微分的闭式表达式

Julius Orion Smith, Harrison Freeman Smith

AI总结本文提出了一种通过实指数等距极零对实现分数积分/微分滤波器的闭式表达式，可实现任意所需的对数-对数斜率，并通过调整极零对的位置来控制精度。

Comments 10 pages, 8 figures

详情

AI中文摘要

我们推导了近似分数积分器/微分器滤波器的极点和零点的闭式表达式，这些表达式对应于具有任意所需对数-对数斜率的频谱滚降滤波器，精度可调。这些滤波器可以描述为沿s平面的负实轴上均匀分布的极点，零点交错其间。任意频谱斜率通过滑动零点数组相对于极点数组的位置获得，每个数组在对数尺度上保持周期性间距。斜率近似性质接近Chebyshev最优，当阶数趋于无穷时接近推测的Chebyshev最优性。实际设计可通过扩大极零数组带宽来任意接近等波纹近似。频谱滚降斜率可通过仅调整由一个斜率参数控制的零点来实时调节。提供了MATLAB和Faust的软件实现。

英文摘要

We derive closed-form expressions for the poles and zeros of approximate fractional integrator/differentiator filters, which correspond to spectral roll-off filters having any desired log-log slope to a controllable degree of accuracy over any bandwidth. The filters can be described as a uniform exponential distribution of poles along the negative-real axis of the s plane, with zeros interleaving them. Arbitrary spectral slopes are obtained by sliding the array of zeros relative to the array of poles, where each array maintains periodic spacing on a log scale. The nature of the slope approximation is close to Chebyshev optimal in the interior of the pole-zero array, approaching conjectured Chebyshev optimality over all frequencies in the limit as the order approaches infinity. Practical designs can arbitrarily approach the equal-ripple approximation by enlarging the pole-zero array band beyond the desired frequency band. The spectral roll-off slope can be robustly modulated in real time by varying only the zeros controlled by one slope parameter. Software implementations are provided in matlab and Faust.

URL PDF HTML ☆

赞 0 踩 0

1602.08609 2026-06-04 cs.SD cs.SY eess.SY 版本更新

A New Robust Frequency Domain Echo Canceller With Closed-Loop Learning Rate Adaptation

一种新的鲁棒频域回声抵消器与闭环学习率适应

Jean-Marc Valin, Iain B. Collings

AI总结本文提出一种基于多延迟块频域回声抵消器的闭环方法，通过学习率与对齐参数成正比，提升回声抵消性能，优于现有双工检测技术6dB。

Comments 4 pages, Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2007

1602.08116 2026-06-04 eess.SY cs.SD cs.SY 版本更新

Interference-Normalised Least Mean Square Algorithm

干扰归一化最小均方算法

Jean-Marc Valin, Iain B. Collings

AI总结本文提出了一种干扰归一化最小均方算法，用于鲁棒自适应滤波。该算法扩展了梯度自适应学习率方法，适用于非稳态信号，尤其在高非稳态干扰信号中表现优异。

Comments 4 pages

1602.08044 2026-06-04 cs.SD cs.SY eess.SY 版本更新

On Adjusting the Learning Rate in Frequency Domain Echo Cancellation With Double-Talk

在频域回声消除中调整学习率

Jean-Marc Valin

AI总结本文提出了一种在频域回声消除中根据双音和回声路径变化调整学习率的新方法，基于噪声环境下NLMS算法最优学习率的推导，通过多延迟块频域自适应滤波器评估，证明其优于现有双音检测技术且易于实现。

Comments 5 pages

1510.05073 2026-06-04 eess.SY cs.SD cs.SY 版本更新

Block Sparse Memory Improved Proportionate Affine Projection Sign Algorithm

块稀疏内存改进的比例仿射投影符号算法

Jianming Liu, Steven L. Grant

AI总结本文提出了一种改进的块稀疏比例仿射投影符号算法，用于在冲击噪声下的块稀疏系统辨识，提升了鲁棒性和效率。

Comments 2 pages, accepted by Electronics Letters

1408.2294 2026-06-04 eess.SY cs.SD cs.SY 版本更新

Digital Filter Designs for Recursive Frequency Analysis

递归频率分析的数字滤波器设计

Hugh L. Kennedy

AI总结本文探讨了用于递归计算离散傅里叶变换和估计采样信号频谱的数字滤波器，重点分析幅度响应和数值稳定性，提出新的改进方法并讨论了FIR和IIR滤波器的稳定性问题。

Comments To appear in Journal of Circuits, Systems, and Computers (JCSC). Accepted draft version, Aug. 2015. Added summary tables. Expanded Conclusion and Summary Section. Fixed a few errors/typos

详情

AI中文摘要

本文研究了用于递归计算离散傅里叶变换（DFT）和估计采样信号频谱的数字滤波器，重点在于幅度响应和数值稳定性。在教程式的方法中，回顾、解释和比较了现有的递归技术，并提供了一些新的见解和改进。研究表明，将谐振器替换为（非递归）调制器在滑动DFT（SDFT）分析器中可以略微提高性能；然而稳定性不能保证，因为对边际稳定极点的抵消仍然存在。FIR死锁观测器比SDFT方法更可靠，一种IIR变体被提出，并讨论了其响应的微调方法。还推导出一种新的技术，用于稳定具有衰减记忆的IIR SDFT分析器，使得所有极点都在单位圆内。Slepian和余弦和窗口被适配以改进各种FIR和IIR DFT方法的频率响应。

英文摘要

Digital filters for recursively computing the discrete Fourier transform (DFT) and estimating the frequency spectrum of sampled signals are examined, with an emphasis on magnitude-response and numerical stability. In this tutorial-style treatment, existing recursive techniques are reviewed, explained and compared within a coherent framework; some fresh insights are provided and new enhancements/modifications are proposed. It is shown that the replacement of resonators by (non-recursive) modulators in sliding DFT (SDFT) analyzers with either a finite impulse response (FIR), or an infinite impulse response (IIR), does improve performance somewhat; however stability is not guaranteed, as the cancellation of marginally stable poles by zeros is still involved. The FIR deadbeat observer is shown to be more reliable than the SDFT methods, an IIR variant is presented, and ways of fine-tuning its response are discussed. A novel technique for stabilizing IIR SDFT analyzers with a fading memory, so that all poles are inside the unit circle, is also derived. Slepian and sum-of-cosine windows are adapted to improve the frequency responses for the various FIR and IIR DFT methods.

URL PDF HTML ☆

赞 0 踩 0