arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 语音合成与声音生成 6 篇

2606.18323 2026-06-18 cs.SD cs.LG 新提交

Reliable Neural-Codec Text-to-Speech by ASR Self-Verification and Distillation: Near-Zero Catastrophic Failures Across Models and Codecs

通过ASR自验证与蒸馏实现可靠的神经编解码文本转语音:跨模型与编解码器的近零灾难性失败

Ali Asaria, Tony Salomone, Deep Gandhi

发表机构 * Transformer Lab

AI总结 针对开放自回归神经编解码TTS模型的随机灾难性失败(静音、早停、重复或幻觉),提出基于ASR往返的格式鲁棒度量,通过最佳N自验证将失败率降至近零,并通过蒸馏将鲁棒性迁移至单次解码,在无测试代价下关闭约52-58%的失败。

详情
AI中文摘要

开放自回归神经编解码文本转语音(TTS)模型在典型输入上表现优异,但会出现随机灾难性失败:在相当一部分话语中,它们会发出静音、提前终止或陷入重复或幻觉内容。我们表明这种失败模式可以廉价地消除。在单一格式鲁棒度量(通过ASR往返的灾难性失败率)下,最佳N ASR自验证将失败率降至近零:在标准语料库(LibriSpeech)上N=2时未观察到失败,在困难提示集上N=4时也未观察到。这不是单一模型的假象:该减少在四个开放编解码TTS系统和三个神经编解码器(XCodec2、SNAC、Mimi)上复现,其中三个系统在N=2时达到近零下限。然后,通过将自验证行为蒸馏到模型中,我们在推理时免费实现了修复,这恢复了单次解码中的大部分鲁棒性,在无测试代价下关闭了困难输入上约52-58%的失败。蒸馏增益集中在需要的地方(困难输入);在已经可靠的散文上,没有改进空间且无检测到变化。一项受控比较添加了一个干净的负面结果:离线直接偏好优化(DPO/IPO)并未优于普通监督蒸馏,而在线迭代变体虽有前景但在我们的评估规模下统计上不显著。我们诚实地报告了唯一抵抗的模型(一个更大的Llasa,其中规模并未明显帮助)以及一个罕见词能力上限,该上限无法通过任何自蒸馏方法克服。

英文摘要

Open autoregressive neural-codec text-to-speech (TTS) models sound excellent on typical inputs yet suffer stochastic catastrophic failures: on a meaningful fraction of utterances they emit silence, terminate early, or collapse into repetitive or hallucinated content. We show this failure mode is cheap to remove. Under a single format-robust metric (a catastrophic-failure rate via an ASR round-trip), best-of-N ASR self-verification drives failures to near-zero: no observed failures remain by N=2 on a standard corpus (LibriSpeech) and by N=4 on a hard prompt set. This is not an artifact of one model: the reduction replicates across four open codec-TTS systems and three neural codecs (XCodec2, SNAC, Mimi), reaching the near-zero floor by N=2 on three of the four. We then make the fix free at inference time by distilling the self-verified behaviour into the model, which recovers much of the robustness in single-shot decoding, closing ~52-58% of the failure mass on hard inputs at no test-time cost. The distillation gain concentrates where it is needed (hard inputs); on already-reliable prose there is no headroom and no detectable change. A controlled comparison adds a clean negative: offline direct preference optimization (DPO/IPO) does not beat plain supervised distillation, and an online iterative variant is promising but not statistically separable at our evaluation size. We report honestly the one model that resists (a larger Llasa where scale did not obviously help) and a rare-word capability ceiling that no self-distillation method overcomes

2606.18485 2026-06-18 cs.SD cs.AI eess.AS 新提交

MagpieTTS-LF: Inference-Time Long-Form Speech Generation Without Training on Long-Form data

MagpieTTS-LF:无需长语音数据训练的推理时长生成长语音生成

Subhankar Ghosh, Jason Li, Paarth Neekhara, Shehzeen Hussain, Ryan Langman, Xuesong Yang, Roy Fejgin

发表机构 * NVIDIA Corporation(英伟达公司)

AI总结 提出MagpieTTS-LF推理时方法,通过软注意力先验、有状态推理和历史感知文本编码,在不重新训练模型的情况下实现连贯的长语音生成。

详情
AI中文摘要

神经文本到语音(TTS)系统在短语句上取得了显著质量,但长语音生成表现出韵律漂移、说话人不一致和句子边界伪影。现有方法要么压缩序列、增加上下文长度,要么简单拼接独立合成的片段。我们提出一种称为MagpieTTS-LF的推理时方法,使MagpieTTS能够在不重新训练模型的情况下生成连贯的长语音。我们的方法引入了三个关键创新:(1)软注意力先验,在保留过去和未来上下文的同时引导单调对齐;(2)有状态推理算法,跨句子块维护上下文,确保韵律连续性;(3)历史感知文本编码,利用过去文本进行语篇级韵律规划。在长文本上的实验表明,与其他基线相比,在长距离可懂度、韵律连贯性、说话人一致性和边界自然度方面有显著改进。

英文摘要

Neural Text-to-Speech (TTS) systems achieve remarkable quality on short utterances but long-form speech generation shows prosodic drift, speaker inconsistencies and sentence boundary artifacts. Existing approaches either compress sequences, increase context length or naively concatenate independently synthesized chunks. We present an inference-time approach called MagpieTTS-LF that enables MagpieTTS to produce coherent long-form speech without model retraining. Our method introduces three key innovations: (1) soft attention priors to guide monotonic alignment while preserving past and future context; (2) a stateful inference algorithm that maintains context across sentence chunks, ensuring prosodic continuity; (3) history-aware text encoding that uses past text for discourse-level prosodic planning. Experiments on long texts show significant improvements in long-range intelligibility, prosodic coherence, speaker consistency, and boundary naturalness compared to other baselines.

2606.19209 2026-06-18 cs.SD 新提交

FineCombo-TTS: Collaborative and Precise Controllable Speech Synthesis Using Text Descriptions and Reference Speech

FineCombo-TTS: 使用文本描述和参考语音的协作式精确可控语音合成

Shuoyi Zhou, Yixuan Zhou, Peiji Yang, Yifan Hu, Yicheng Zhong, Zhisheng Wang, Zhiyong Wu

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Inner Mongolia University(内蒙古大学) Tencent(腾讯)

AI总结 提出FineCombo-TTS统一框架,通过条件流匹配的语音方差预测器实现基于文本描述的细粒度参考到目标变换,实现灵活精确的声学属性控制。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

可控文本到语音(TTS)已成为一个关键研究焦点。然而,基于参考语音或文本描述的方法缺乏灵活性和精确控制,最近的联合方法仍然松散耦合,语音建模音色而文本控制全局风格。我们提出FineCombo-TTS,一个基于参考语音并由文本描述引导的语音合成统一框架,能够对声学属性进行灵活精确的控制。不同于显式属性解耦,我们学习统一的声学表示,并引入基于条件流匹配(CFM)的语音方差预测器,以建模由文本描述引导的细粒度参考到目标变换。为了支持相对属性控制,我们构建了FineEdit,一个结构化的配对数据集,显式编码源到目标的属性变化。实验表明,我们的方法实现了灵活、精确且富有表现力的可控TTS。

英文摘要

Controllable text-to-speech (TTS) has become a key research focus. However, methods based on either reference speech or text descriptions lack flexibility and precise control, and recent joint approaches remain loosely coupled, with speech modeling timbre and text controlling global style. We propose FineCombo-TTS, a unified framework for speech synthesis grounded in reference speech and guided by text descriptions, enabling flexible and precise control over acoustic attributes. Instead of explicit attribute disentanglement, we learn a unified acoustic representation and introduce a Conditional Flow Matching (CFM)-based Speech Variance Predictor to model fine-grained reference-to-target transformations guided by text descriptions. To support relative attribute control, we construct FineEdit, a structured paired dataset that explicitly encodes source-to-target attribute variations. Experiments demonstrate that our approach achieves flexible, precise, and expressive controllable TTS.

2606.19325 2026-06-18 cs.SD cs.AI cs.CV 新提交

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

参考驱动的野外先验多说话人音频场景生成

Michael Finkelson, Daniel Segal, Eitan Richardson, Shahar Armon, Nani Goldring, Poriya Panet, Nir Zabari, Benjamin Brazowski, Or Patashnik, Yoav HaCohen

发表机构 * Lightricks Tel Aviv University(特拉维夫大学)

AI总结 提出ScenA方法,利用预训练的文本到音频流匹配基础模型,通过多参考声音和自然语言提示生成多说话人音频场景,并采用高噪声偏置时间步分布解决参考捷径问题,在CoVoMix2-Dialogue基准上优于现有系统。

Comments Project page at https://finmickey.github.io/scena/

详情
AI中文摘要

现有的多说话人对话系统通过结构化监督(如每轮标签、多流转录或可学习说话人嵌入)将说话人与话语绑定。这些系统在仅语音的流水线中运行,生成干净的语音序列,缺乏真实对话的环境纹理。我们采取不同的方法。我们的方法ScenA将文本到音频流匹配基础模型(在大规模野外数据上预训练)直接以多个参考声音和描述整个多说话人音频场景的自由形式自然语言提示为条件。利用这样的基础模型使我们能够继承其生成自然、非录音室音频的能力:背景噪声、房间声学、重叠对话和自发的副语言事件,同时添加多说话人控制而无需任何每轮结构。具体地,参考潜在向量被连接到模型的令牌序列中,并通过轻量级的身份感知位置编码进行区分。然而,我们识别出这种方法的一个关键障碍:参考捷径。在标准噪声调度下的训练过程中,模型可以通过声学相似性识别匹配的参考与噪声目标,从而完全绕过文本提示。我们通过高噪声偏置的时间步分布来解决这个问题,迫使模型依赖文本提示进行说话人分配。我们在CoVoMix2-Dialogue基准上评估ScenA,结果表明它在说话人绑定指标上优于现有的多说话人系统,同时生成具有重叠语音、情感发声和环境声音的丰富对话音频。我们的结果证明了使用以自由形式场景描述为条件的通用音频模型,而不是通过仅语音流水线传递结构化对话脚本的优势。

英文摘要

Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model's token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the \textit{Reference Shortcut}. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.

2509.09631 2026-06-18 cs.SD cs.CL cs.CV 版本更新

DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Discrete Flow Matching

DiFlow-TTS: 基于离散流匹配的紧凑低延迟零样本文本转语音

Ngoc-Son Nguyen, Thanh V. T. Tran, Hieu-Nghia Huynh-Nguyen, Truong-Son Hy, Van Nguyen

AI总结 提出DiFlow-TTS框架,通过离散流匹配和分解离散流去噪器,在零样本TTS中实现高质量与低延迟的平衡。

Comments Accepted at Interspeech 2026 (Long Paper Track)

详情
AI中文摘要

零样本文本转语音(TTS)在复制未见过的声音方面取得了显著进展,但平衡生成质量和推理效率仍然具有挑战性。自回归模型存在高延迟问题,而基于扩散的方法受限于训练时的配置。此外,大多数基于流的方法在连续空间中运行,由于连续令牌空间本质上比离散空间更复杂,这引入了优化挑战。为了解决这些限制,我们提出了DiFlow-TTS,一种基于离散流匹配的新型零样本TTS框架。该模型由一个用于语言建模的确定性音素-内容映射器和一个同时生成韵律和声学令牌流的分解离散流去噪器组成。实验结果表明了我们的方法在多个评估指标上的有效性。

英文摘要

Zero-shot text-to-speech (TTS) has made significant progress in replicating unseen voices, yet balancing generation quality and inference efficiency remains challenging. Autoregressive models suffer from high latency, while diffusion-based approaches are constrained by training-time configurations. Moreover, most flow-based methods operate in continuous space, which introduces optimization challenges because continuous token spaces are inherently more complex than discrete ones. To address these limitations, we propose DiFlow-TTS, a novel zero-shot TTS framework based on discrete flow matching. The model consists of a deterministic Phoneme-Content Mapper for linguistic modeling and a Factorized Discrete Flow Denoiser that simultaneously generates prosody and acoustic token streams. Experimental results demonstrate the effectiveness of our approach across multiple evaluation metrics.

2506.12311 2026-06-18 cs.CL cs.SD eess.AS 版本更新

Phonikud: Overcoming Phonetic Underspecification for Hebrew Text-To-Speech

Phonikud:克服希伯来语文本转语音中的语音欠指定问题

Yakov Kolani, Maxim Melichov, Cobi Calev, Morris Alper

发表机构 * Independent Researcher(独立研究者) Reichman University(雷赫曼大学) Tel Aviv University(特拉维夫大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出Phonikud框架,通过开源G2P系统、语料库、基准和评估模型,解决希伯来语TTS中重音等语音特征欠指定问题,实现更准确的音素预测。

Comments Accepted to Interspeech 2026. Project page: https://phonikud.github.io

详情
AI中文摘要

现代希伯来语的文本转语音(TTS)受到该语言正字法复杂性的挑战,现有解决方案忽略了诸如重音等欠指定的语音特征。我们提出了一个更准确的希伯来语TTS框架,包含四个贡献:(1)Phonikud,一个开源的希伯来语字素到音素(G2P)系统,输出完全指定的国际音标(IPA)转录,通过增强基础注音器设计而成。(2)ILSpeech语料库,包含配对的希伯来语音频、文本和专家IPA标注。(3)针对先前未测量的希伯来语G2P转换任务的基准。(4)希伯来语音频到IPA模型,捕获先前忽略的语音细节,用于自动TTS评估。我们的结果表明,Phonikud比先前方法更准确地预测希伯来语音素,并且使用Phonikud语音输入的小型本地TTS模型接近大型专有系统。我们在以下网址发布代码、数据和模型:this https URL。

英文摘要

Text-to-speech (TTS) for Modern Hebrew is challenged by the language's orthographic complexity, with existing solutions ignoring underspecified phonetic features such as stress. We present a framework for more phonetically accurate Hebrew TTS with four contributions: (1) Phonikud, an open-source Hebrew grapheme-to-phoneme (G2P) system that outputs fully-specified International Phonetic Alphabet (IPA) transcriptions, designed by augmenting a base diacritizer. (2) The ILSpeech corpus of paired Hebrew audio, text, and expert IPA annotations. (3) A benchmark for the previously unmeasured task of Hebrew G2P conversion. (4) Hebrew audio-to-IPA models capturing previously disregarded phonetic details for automatic TTS evaluation. Our results show that Phonikud more accurately predicts Hebrew phonemes than prior methods, and that small, local TTS models with phonetic input from Phonikud approach large proprietary systems. We release our code, data, and models at https://phonikud.github.io.

2. 说话人识别、验证与分离 2 篇

2603.10827 2026-06-18 cs.SD cs.AI 版本更新

Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation

语音感知大语言模型的说话人验证:评估与增强

Thomas Thebaud, Yuzhe Wang, Laureano Moro-Velazquez, Jesus Villalba-Lopez, Najim Dehak

发表机构 * Electrical and Computer Engineering Department, Johns Hopkins University, Baltimore, MD, USA(约翰霍普金斯大学电气与计算机工程系) Human Language Technology Center of Excellence, Johns Hopkins University, Baltimore, MD, USA(约翰霍普金斯大学人机语言技术中心卓越中心)

AI总结 提出模型无关的评分协议评估语音感知LLM的说话人区分能力(EER>20%),并通过注入冻结的ECAPA-TDNN说话人嵌入和LoRA微调,实现接近专用系统的性能(EER 1.03%)。

Comments 3 Tables, 1 Figure, Published in Interspeech 2026

详情
AI中文摘要

语音感知大语言模型(LLMs)可以接受语音输入,但其训练目标主要强调语言内容或特定领域(如情感或说话人性别),尚不清楚它们是否编码了说话人身份。首先,我们提出了一种模型无关的评分协议,该协议利用Yes/No令牌概率的置信度分数或对数似然比,为仅API模型和开放权重模型生成连续验证分数。使用该协议,我们评估了最近的语音感知LLMs,观察到较弱的说话人区分能力(在VoxCeleb1上EER高于20%)。其次,我们引入了一种轻量级增强方法,通过可学习的投影注入冻结的ECAPA-TDNN说话人嵌入,并仅训练LoRA适配器,使LLM具备自动说话人验证(ASV)能力。在TinyLLaMA-1.1B上,得到的ECAPA-LLM在VoxCeleb1-E上实现了1.03%的EER,接近专用说话人验证系统,同时保留了自然语言接口。

英文摘要

Speech-aware large language models (LLMs) can accept speech inputs, yet their training objectives largely emphasize linguistic content or specific fields such as emotions or the speaker's gender, leaving it unclear whether they encode speaker identity. First, we propose a model-agnostic scoring protocol that produces continuous verification scores for both API-only and open-weight models, using confidence scores or log-likelihood ratios from the Yes/No token probabilities. Using this protocol, we benchmark recent speech-aware LLMs and observe weak speaker discrimination (EERs above 20% on VoxCeleb1). Second, we introduce a lightweight augmentation that equips an LLM with ASV capability by injecting frozen ECAPA-TDNN speaker embeddings through a learned projection and training only LoRA adapters. On TinyLLaMA-1.1B, the resulting ECAPA-LLM achieves 1.03% EER on VoxCeleb1-E, approaching a dedicated speaker verification system while preserving a natural-language interface.

2606.05739 2026-06-18 cs.SD eess.AS 版本更新

Do speech foundation models perceive speaker similarity as humans do?

语音基础模型是否像人类一样感知说话人相似性?

Minoru Kishi, Hayato Yagi, Shinnosuke Takamichi, Yuki Saito

发表机构 * Keio University, Japan(庆应大学,日本) The University of Tokyo, Japan(东京大学,日本)

AI总结 本研究通过比较40多个语音基础模型的说话人嵌入与人类主观相似性评分,探究模型距离是否与人类感知一致,并识别影响模型与人类感知一致性的关键配置因素。

Comments Accepted by INTERSPEECH 2026. Camera-ready version

详情
AI中文摘要

本研究对语音基础模型的说话人嵌入与人类对说话人相似性的主观感知进行了比较分析。人类听众能够在一个连续尺度上判断说话人的相似性,辨别两个声音的相似程度。相比之下,语音基础模型将说话人特征嵌入到数值表示中。然而,一个问题仍然存在:这些模型中说话人嵌入之间的数值距离是否真正与人类感知的相似性一致?为了解决这个问题,我们使用超过40个模型进行了全面调查,将模型导出的距离与人类感知的相似性评分进行比较。此外,我们确定了模型配置中的哪些因素对产生反映人类感知的说话人嵌入贡献最大。我们的发现为开发更具感知基础的语音基础模型提供了见解。

英文摘要

This study presents a comparative analysis between the speaker embeddings of speech foundation models and human subjective perception of speaker similarity. Human listeners have the ability to judge speaker similarity on a continuous scale discerning how similar two voices are. In contrast, speech foundation models embed speaker characteristics into numerical representation. However, a question remains: does the numerical distance between speaker embeddings in these models truly align with the similarity perceived by humans? To address this, we conduct a comprehensive investigation using more than 40 models to compare model-derived distances with human-perceived similarity scores. Furthermore, we identify which factors in model configuration contribute most to a speaker embedding that mirrors human perception. Our findings provide insights for the development of more perceptually grounded speech foundation models.

3. 语音增强、降噪与音频修复 2 篇

2606.18564 2026-06-18 cs.SD eess.SP 新提交

Reference-Based Recursive Least-Squares Mitigation of Real Interference in Stereo Audio Recordings

基于参考的递归最小二乘法在立体声音频录音中抑制真实干扰

Necati Kagan Erkek, Y. Ugur Ozcan

发表机构 * Telecommunications Engineering, Department of Electronics, Information(电信工程系,电子与信息系)

AI总结 针对受真实火车噪声和环境背景污染的立体声音频,采用多参考递归最小二乘(RLS)估计器进行自适应干扰消除,通过参考信号估计干扰分量并减去,后接低通后置滤波器,有效降低参考相关性达30.6-34.1 dB。

Comments 7 pages

详情
AI中文摘要

评估了基于参考的自适应干扰消除方法,用于受真实火车噪声和环境背景污染的立体声音频录音。观测信号被建模为干净的立体声节目受到由外部声源通过未知传播路径产生的加性干扰污染。第二个立体声录音,代表同一物理噪声源的另一个滤波观测,被用作多参考递归最小二乘(RLS)估计器的参考输入。估计的火车干扰分量从含噪音频中减去,随后经过有限冲激响应低通后置滤波器。在相同算法参数下处理了三个74.01秒、采样率为11.025 kHz的真实音频序列。由于没有干净的参考真值,性能通过无参考指标评估:波形行为、Welch谱估计、RMS变化以及与参考的残差归一化相关性。每个参考通道使用30个抽头、15个反因果抽头和遗忘因子0.999,最大参考相关性从处理前的0.386--0.832降低到处理后的0.011--0.016。相应的相关性比降低约30.6--34.1 dB,而输出RMS根据片段和立体声通道减少1.8--4.8 dB。结果表明,当存在相关参考录音时,真实火车干扰(包括环境声学效应)可以被显著衰减。

英文摘要

Reference-based adaptive interference cancellation is evaluated for stereo audio recordings corrupted by real train noise and environmental background. The observed signal is modeled as a clean stereo program contaminated by an additive disturbance generated by an external acoustic source through unknown propagation paths. A second stereo recording, representing another filtered observation of the same physical noise source, is used as the reference input of a multi-reference recursive least-squares (RLS) estimator. The estimated train-interference component is subtracted from the noisy audio and followed by a finite-impulse-response low-pass postfilter. Three 74.01 s real audio sequences sampled at 11.025 kHz are processed under identical algorithmic parameters. Since clean ground truth is not available, performance is assessed with no-reference indicators: waveform behavior, Welch spectral estimates, RMS change, and residual normalized correlation with the reference. With 30 taps per reference channel, 15 anti-causal taps, and forgetting factor 0.999, the maximum reference correlation is reduced from 0.386--0.832 before processing to 0.011--0.016 after processing. The corresponding correlation-ratio reduction is approximately 30.6--34.1 dB, while the output RMS decreases by 1.8--4.8 dB depending on section and stereo channel. The results demonstrate that real train interference, including environmental acoustic effects, can be substantially attenuated when a correlated reference recording is available.

2606.18611 2026-06-18 cs.SD cs.AI cs.LG stat.ML 新提交

QC-GAN: A Parameter-Efficient Quaternion Conformer GAN for High-Fidelity Speech Enhancement

QC-GAN: 一种参数高效的四元数Conformer GAN用于高保真语音增强

Shogo Yamauchi, Hideaki Tamori, Makoto Sakai, Yosuke Yamano, Tohru Nitta

发表机构 * The Asahi Shimbun Company(朝日新闻社) Tokyo Woman's Christian University(东京女子基督教大学)

AI总结 提出参数高效的QC-GAN,结合四元数Conformer生成器和MetricGAN训练,通过汉密尔顿积共享权重减少参数量,在VoiceBank+DEMAND上以0.89M参数达到PESQ 3.48,性能媲美两倍大小模型。

Comments 10 pages, 6 figures and 5 tables. Accepted at Interspeech2026

详情
AI中文摘要

我们提出了一种参数高效的语音增强框架——四元数Conformer GAN(QC-GAN),它将四元数Conformer生成器与基于MetricGAN的训练相结合。汉密尔顿积通过结构化权重共享对幅度和相位进行编码,在减少层参数数量的同时保持其相互依赖性。采用度量学习判别器,通过优化近似感知评估分数来最大化感知质量。在VoiceBank+DEMAND数据集上,QC-GAN仅用0.89M参数就达到了3.48的语音质量感知评估(PESQ)分数,其性能与最先进模型相当,而参数量不到后者的一半。一个35K参数的变体实现了3.23的PESQ分数,以显著更少的参数超越了传统方法。在DNS-Challenge 3数据集上的评估进一步证实了其在真实世界条件下的泛化能力。

英文摘要

We propose a parameter-efficient speech enhancement framework, Quaternion Conformer GAN (QC-GAN), which combines a Quaternion Conformer generator with MetricGAN-based training. The Hamilton product encodes the magnitude and phase via structured weight sharing, reducing the number of layer parameters while preserving their interdependencies. A metric-learning discriminator was employed to maximize perceptual quality by optimizing the approximate perceptual evaluation scores. On the VoiceBank+DEMAND dataset, QC-GAN achieved a Perceptual Evaluation of Speech Quality (PESQ) score of 3.48 with only 0.89M parameters, delivering a performance comparable to state-of-the-art models at less than half their size. A 35K-parameter variant achieved a PESQ score of 3.23, surpassing conventional methods with significantly fewer parameters. Evaluation on the DNS-Challenge 3 dataset further confirmed generalization to real-world conditions.

4. 音频事件检测与场景理解 2 篇

2606.18664 2026-06-18 cs.SD cs.AI 新提交

NeuralMUSIC: A Hybrid Neural-Subspace Framework for Robot Sound Source Localization

NeuralMUSIC: 一种用于机器人声源定位的混合神经-子空间框架

Yizhuo Yang, Junqiao Fan, Shenghai Yuan, Lihua Xie

发表机构 * School of Electrical and Electronic Engineering, Nanyang Technological University(南洋理工大学电气与电子工程学院)

AI总结 提出NeuralMUSIC混合框架,结合神经网络估计空间协方差矩阵与经典MUSIC子空间方法,通过频率注意力融合和自监督学习提升机器人声源定位的鲁棒性和跨域泛化能力。

详情
AI中文摘要

可靠的声源定位是机器人听觉的基础,使自主机器人能够感知空间线索并在动态环境中有效运行。经典方法如多信号分类(MUSIC)具有坚实的理论基础,但在低信噪比下性能下降。基于深度学习的方法虽然取得了有前景的性能,但通常难以在多种条件下泛化。为了解决这些挑战,我们提出了NeuralMUSIC,一种用于机器人声源定位的混合神经-子空间框架。具体来说,神经网络首先从多通道麦克风观测中估计空间协方差矩阵。然后将预测的协方差集成到经典的MUSIC流程中,包括特征值分解(EVD)和伪谱计算,随后通过频率注意力融合(FAF)模块产生最终的DOA估计。为了提高数据效率,我们进一步引入了一种自监督空间相关学习(SSCL)策略,利用未标记的声学数据来捕获空间结构。跨不同机器人任务的广泛实验表明,NeuralMUSIC在实现有竞争力的定位精度的同时,表现出更强的鲁棒性和跨域泛化能力。

英文摘要

Reliable sound source localization is fundamental to robot audition, enabling autonomous robots to perceive spatial cues and operate effectively in dynamic environments. Classical methods such as Multiple Signal Classification (MUSIC) offer strong theoretical foundations but degrade under low signal-to-noise ratios. While deep learning-based approaches achieve promising performance, they often struggle with limited generalization across conditions. To address these challenges, we propose NeuralMUSIC, a hybrid neural-subspace framework for robotic sound source localization. Specifically, a neural network first estimates the spatial covariance matrix from multichannel microphone observations. The predicted covariance is then integrated into a classical MUSIC pipeline with eigenvalue decomposition (EVD) and pseudo-spectrum computation, followed by a Frequency Attention Fusion (FAF) module to produce the final DOA estimates. To improve data efficiency, we further introduce a Self-supervised Spatial Correlation Learning (SSCL) strategy that leverages unlabeled acoustic data to capture spatial structure. Extensive experiments across different robotic tasks demonstrate that NeuralMUSIC achieves competitive localization accuracy while exhibiting improved robustness and cross-domain generalization.

2606.19269 2026-06-18 cs.SD 新提交

Scoring Backends Matter More Than Pooling: A Systematic Study of Training-Free Anomalous Sound Detection under Domain Shift

评分后端比池化更重要:域偏移下无训练异常声音检测的系统研究

Jingwen Zhou, Mingzhe Wang

发表机构 * Xidian University(西安电子科技大学)

AI总结 本研究系统比较了无训练异常声音检测中不同评分后端和时序池化方法对域偏移鲁棒性的影响,发现后端选择(如余弦距离、马氏距离等)主导性能,平均AUC变化13.8点,而池化仅3.2点,并提出无标签分数融合方法。

详情
AI中文摘要

无训练异常声音检测(ASD)通过将测试片段与来自冻结预训练音频编码器的正常嵌入记忆库进行评分。最近的研究将域偏移鲁棒性主要归因于帧级特征随时间池化的方式;而应用于池化嵌入之上的评分后端受到的关注较少。使用单个冻结的BEATs编码器在DCASE 2023 Task 2开发集(全部七种机器类型)上,我们交叉了四种经典后端——最近邻余弦距离、马氏距离、局部密度归一化kNN和PCA子空间重建残差——与三种时序池化(均值、GeM、最大值)。切换后端使目标域AUC平均移动13.8点(最高达53.8),而切换池化仅移动3.2点:在这种无训练机制中,后端而非池化主导域偏移鲁棒性。没有后端在所有情况下都表现最佳,但机器相关的模式在DCASE 2025开发数据(风扇、轴承)上重现。利用这一点,我们提出了一种无标签分数融合方法,该方法对每个后端使用其训练库自分数进行z归一化并取最小值;它达到了63.3%的调和平均目标AUC,而每机器oracle为64.4%,超过了所有固定的单一后端,同时保持了源域准确性。我们还报告了一个负面结果:通过源域伪验证与代理异常来选择后端失败,因为所有后端在代理任务上都饱和了。

英文摘要

Training-free anomalous sound detection (ASD) scores a test clip against a memory bank of normal embeddings from a frozen pretrained audio encoder. Recent work attributes domain-shift robustness mainly to how frame-level features are pooled over time; the scoring backend applied on top of the pooled embedding has received far less systematic attention. Using a single frozen BEATs encoder on the DCASE 2023 Task 2 development set (all seven machine types), we cross four classical backends -- nearest-neighbor cosine distance, Mahalanobis distance, locally density-normalized kNN, and PCA-subspace reconstruction residual -- with three temporal poolings (mean, GeM, max). Switching the backend moves target-domain AUC by 13.8 points on average (up to 53.8), whereas switching the pooling moves it by only 3.2 points: in this training-free regime, the backend, not the pooling, dominates domain-shift robustness. No backend wins everywhere, but the machine-dependent pattern reproduces on the DCASE 2025 development data (fan, bearing). Exploiting this, we propose a label-free score fusion that z-normalizes each backend with its training-bank self-scores and takes the minimum; it reaches a harmonic-mean target AUC of 63.3% versus 64.4% for the per-machine oracle, surpassing every fixed single backend while preserving source-domain accuracy. We also report a negative result: selecting a backend by source-domain pseudo-validation with proxy outliers fails, because all backends saturate on the proxy task.

5. 音乐信息检索与音乐生成 2 篇

2606.18790 2026-06-18 cs.SD cs.AI cs.LG 新提交

Closing the Loop: PID Feedback Control for Interpretable Activation Steering in Symbolic Music Generation

闭环:用于符号音乐生成中可解释激活引导的PID反馈控制

Ioannis Prokopiou, Pantelis Vikatos, Maximos Kaliakatsos-Papakostas, Theodoros Giannakopoulos, Themos Stafylakis

发表机构 * Athens University of Economics and Business(雅典经济与商业大学) Orfium Research(Orfium 研究) Hellenic Mediterranean University(希腊地中海大学) Archimedes / Athena Research Center(阿基米德/雅典娜研究中心)

AI总结 提出基于PID反馈控制的推理时激活引导框架,通过差分均值法提取音高和时长潜在方向,并利用Gram-Schmidt正交化解耦多属性引导,实现符号音乐生成中细粒度、可解释的属性调制。

Comments Accepted at Learning to Listen: ICML 2026 Workshop on Machine Learning for Audio (43rd International Conference on Machine Learning - ICMLMLA26), 4 pages main (11 total), 2 figures

详情
AI中文摘要

基于Transformer的架构在生成复杂符号序列方面取得了显著进展,但在实现对离散信号属性的细粒度、可解释控制方面仍存在明显差距。本文研究了多轨音乐Transformer(MMT)的机制可解释性,并提出了一种无需重新训练即可通过推理时激活引导实现确定性属性调制的框架。利用差分均值(DiffMean)方法,我们在残差流中分离出信号属性(特别是音高和时长)的潜在方向。我们验证了该领域的线性表示假设,实现了引导幅度与属性偏移之间的高相关性。为了解决多属性引导中固有的特征纠缠问题,我们引入了一种利用Gram-Schmidt正交化的双引导框架。实验结果表明,与朴素向量加法相比,这种几何解耦减少了概念干扰和信号退化,即使在强自回归条件下也能实现独立的确定性控制。

英文摘要

Transformer-based architectures have significantly advanced the generation of complex symbolic sequences, yet a significant gap remains in achieving fine-grained, interpretable control over discrete signal attributes. This paper investigates the mechanistic interpretability of the Multitrack Music Transformer (MMT) and proposes a framework for deterministic attribute modulation without retraining to bridge this gap via inference-time activation steering. Utilizing the Difference-in-Means (DiffMean) methodology, we isolate latent directions for signal attributes, specifically Pitch and Duration, within the residual stream. We validate the Linear Representation Hypothesis in this domain, achieving high correlation between steering magnitude and attribute shift. To address the inherent feature entanglement in multi-attribute steering, we introduce a Dual Steering framework utilizing Gram-Schmidt Orthogonalization. Experimental results demonstrate that this geometric decoupling reduces conceptual interference and signal degradation compared to naive vector addition, enabling independent deterministic control even against strong autoregressive conditioning.

2606.15088 2026-06-18 cs.SD cs.CL eess.AS 版本更新

When the Same Musical Knowledge Forgets Differently: A Clean Probe of Pathway-Dependent Forgetting

当相同的音乐知识以不同方式遗忘:路径依赖遗忘的干净探测

Yu Liu, Zhiwei Yang, Wenxiao Zhang, Cong Cao, Fangfang Yuan, Kun Peng, Haimei Qin, Lei Jiang, Jin B. Hong, Hao Peng, Yanbing Liu

发表机构 * Institute of Information Engineering, CAS(中国科学院信息工程研究所) School of Cyber Security, UCAS(中国科学院大学网络空间安全学院) The University of Western Australia(西澳大利亚大学) Beihang University(北京航空航天大学)

AI总结 提出配对路径控制协议(PPCP),发现多模态模型中通过文本路径获取的知识比音频路径更易遗忘,且该效应不受架构深度影响,主要源于输入表示差异。

详情
AI中文摘要

一个模型可以通过听音频或阅读文本描述来学习钢琴曲《致爱丽丝》是平静而沉思的,但当这些知识后来面临遗忘风险时,获取路径是否重要?多模态模型中的遗忘研究衡量了在适应过程中丢失了哪些知识,但尚未探究获取路径是否影响知识被遗忘的难易程度。我们将这个未经检验的前提称为路径不变假设。音乐理解提供了一个干净的测试,因为一段音乐剪辑和一段规范的文本描述可以对齐到相同的感知内容,使得相同的知识单元可以通过听或读进入模型,而目标保持不变。在多个架构不同的音频-语言模型中,我们观察到一致的不对称性:在相同的适应压力下,文本路径知识比匹配的音频路径知识更容易被遗忘。为了将这种效应归因于路径而非混淆因素,我们引入了配对路径控制协议(PPCP),这是一个三阶段设计,建立匹配的路径基线,在相同的知识池上以对称监督激活两条路径,并对两条路径施加相同的遗忘压力。这种差距在模型间和增益控制分析中稳定存在,当矛盾覆盖被替换为正确标签的跨域学习时仍然存在,在单模态压力下仍然存在,并且不会被轻量级重放消除。两个独立的路径深度控制证实,该效应不能由架构深度解释,表明输入表示是主导因素。在PPCP下,我们的结果表明遗忘高度依赖于路径,将获取路径确立为遗忘研究和多模态系统设计的一个新的分析维度。

英文摘要

A model can learn that the piano piece Für Elise is calm and reflective by listening to the audio or by reading a text description, but does it matter which route that knowledge took when it is later at risk of being forgotten? Forgetting research in multimodal models measures what knowledge is lost under adaptation, yet has not asked whether acquisition route affects how easily that knowledge is forgotten. We call this untested premise the Pathway-Invariant Assumption. Music understanding enables a clean test because a music clip and a canonical text description can be aligned to the same perceptual content, allowing the same knowledge unit to enter a model through listening or reading while the target remains fixed. Across multiple architecturally distinct audio-language models, we observe a consistent asymmetry: text-pathway knowledge is forgotten more than matched audio-pathway knowledge under identical adaptation pressure. To attribute this effect to route rather than confounds, we introduce the Paired Pathway Controlled Protocol (PPCP), a three-phase design that establishes matched pathway baselines, activates both pathways under symmetric supervision on the same knowledge pool, and applies identical forgetting pressure to both pathways. The gap is stable across models and gain-controlled analyses, persists when contradictory overwrite is replaced by correct-label cross-domain learning, remains under single-modality pressure, and is not removed by lightweight replay. Two independent routing-depth controls confirm that the effect is not explained by architectural depth, pointing to input representation as the dominant factor. Under PPCP, our results demonstrate that forgetting is highly route-dependent, establishing acquisition route as a new analytical dimension for forgetting research and multimodal system design.

6. 语音翻译与语音语言模型 3 篇

2606.18924 2026-06-18 cs.SD 新提交

Who Wins the Conflict? Mechanistic Interpretability of Text Bias in Audio LLMs

谁赢得冲突?音频大模型中文本偏差的机制可解释性

Hyebin Cho, Suho Yoo, Jaehyuk Jang, Changick Kim, Joon Son Chung

发表机构 * School of Electrical Engineering, KAIST(韩国科学技术院电子工程学院)

AI总结 本文通过机制分析揭示音频大模型中的文本主导偏差,发现文本路径主动抑制完整音频表征,并提出无训练干预方法back-patching以增强音频表征,缓解文本主导。

Comments Preprint

详情
AI中文摘要

虽然音频大模型在多模态理解方面表现出色,但它们存在文本主导偏差,即模型盲目偏向文本而忽视声学证据,导致幻觉。然而,当音频和文本输入相互矛盾时,这些模型内部行为的底层机制尚未被探索。在这项工作中,我们通过追踪内部表征在层间的传播,首次对这一现象进行了机制分析。我们的研究揭示了三个关键发现:(i)文本主导在模型中系统性地且经验性地存在;(ii)虽然文本和音频依赖功能不同的路径,但它们最终在后期层中汇聚到一个共享语义空间;(iii)文本路径不会擦除音频信息,而是主动抑制完整的音频表征。基于这些见解,我们利用back-patching,一种无训练干预方法,将后期层的音频激活路由回早期层。这放大了音频表征,使其能够克服文本抑制。我们的评估表明,back-patching持续减少文本主导,为冲突下的机制性多模态对齐铺平了道路。

英文摘要

While Audio Large Language Models (Audio LLMs) excel at multimodal understanding, they suffer from text dominance, a bias where models blindly favor text over acoustic evidence, causing hallucinations. However, the internal mechanisms underlying how these models behave when audio and textual inputs contradict each other remain unexplored. In this work, we present the first mechanistic analysis of this phenomenon by tracing the propagation of internal representations across layers. Our investigation reveals three key findings: (i) text dominance is systematically and empirically across models; (ii) while text and audio rely on functionally distinct pathways, they ultimately converge into a shared semantic space in late layers; and (iii) the text pathway does not erase audio information, but rather actively suppresses intact audio representations. Building on these insights, we leverage back-patching, a training-free intervention that routes late-layer audio activations back into earlier layers. This amplifies the audio representations, enabling them to overcome textual suppression. Our evaluation shows that back-patching consistently reduces text dominance, paving the way for mechanistic multimodal alignment under conflict.

2606.18273 2026-06-18 cs.CL cs.AI cs.SD eess.AS 交叉投稿

Continuous Audio Thinking for Large Audio Language Models

面向大型音频语言模型的连续音频思考

Gyojin Han, Dong-Jae Lee, Changho Choi, Jongsuk Kim, Junmo Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 提出连续音频思考(CoAT)框架,通过专家蒸馏在连续潜在空间中组织声学信息,使音频语言模型在生成响应前利用丰富声学特征,无需额外自回归解码成本,在多个音频任务上提升性能。

Comments Preprint

详情
AI中文摘要

大型音频语言模型(LALMs)在从语音转录到音乐分析等多种音频理解任务中展现了令人印象深刻的能力。然而,由于LALMs通常被训练生成与文本对齐的响应,其隐藏状态逐渐为文本生成而塑造,而非保留声学信息。因此,音频携带的多样化声学内容,如语音细节、韵律、声音事件、情感和音调,在过程中丢失,难以在响应中利用。我们引入了连续音频思考(CoAT),这是一个框架,为音频语言模型配备一个连续的潜在工作空间,用于在响应生成之前组织声学信息,并通过音频专家的蒸馏进行基础化。在思考空间内,模型可以在生成响应时利用专家蒸馏提供的丰富声学信息。此外,所提出的连续思考块可以在单个预填充中处理,因此CoAT不需要比基线额外的自回归解码成本。在三个LALM上,Qwen2-Audio、Qwen2.5-Omni-7B和Audio Flamingo~3,在涵盖音频推理、音频理解、音乐分类、语音情感和语音转录的广泛基准套件上的性能提升证明了CoAT的有效性。进一步分析证实,辅助监督从思考位置传播到模型的文本响应。

英文摘要

Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned responses, their hidden states are progressively shaped for text generation rather than for preserving acoustic information. As a result, the diverse acoustic content that audio carries, such as phonetic detail, prosody, sound events, affect, and pitch, is lost along the way and difficult to leverage in the response. We introduce Continuous Audio Thinking (CoAT), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information prior to response generation, grounded by distillation from audio experts. Within the thinking space, the model can utilize the rich acoustic information provided by expert distillation when generating its response. Furthermore, the proposed continuous thinking block can be processed in a single prefill, so CoAT does not require additional autoregressive decoding cost over the baseline. Across three LALMs, Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3, performance gains on a broad benchmark suite spanning audio reasoning, audio understanding, music classification, speech emotion, and speech transcription demonstrate the effectiveness of CoAT. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model's textual responses.

2508.07375 2026-06-18 cs.CL cs.SD eess.AS 版本更新

TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving

TurnGuide: 通过动态轮次级文本-语音交错增强有意义的全双工口语交互

Wenqian Cui, Lei Zhu, Xiaohui Li, Zhihan Guo, Haoli Bai, Lu Hou, Irwin King

发表机构 * The Chinese University of Hong Kong(香港中文大学) Huawei Technologies(华为技术)

AI总结 提出TurnGuide方法,通过动态分割助手语音为对话轮次并交错生成轮次级文本和语音,解决全双工语音语言模型在连续双通道音频中集成离散文本令牌导致的时间对齐问题,显著提升语义连贯性和轮次交互性能。

Comments Interspeech 2026 Long Paper Track

详情
AI中文摘要

全双工语音语言模型(FD-SLMs)是专门的基础模型,旨在通过建模复杂的对话轮次(如打断、反馈和重叠语音)来实现自然的实时口语交互。端到端(e2e)FD-SLMs利用真实世界的双通道对话数据捕捉细微的双说话者对话模式以实现类人交互,但由于语音序列过长和高质量口语对话数据有限,其对话能力往往比纯文本对话有所下降。尽管交错文本-语音生成可以缓解这种退化,但将离散文本令牌集成到连续双通道音频流中可能会破坏流畅交互所需的时间对齐。为了解决这个问题,我们提出了TurnGuide,一种用于e2e FD-SLMs的新型文本-语音交错生成方法,该方法动态地将助手语音分割成对话轮次,并交错生成轮次级文本和语音。这种方法使FD-SLMs能够整合LLMs的语义智能,同时不损害自然的声学流畅性。大量实验表明,TurnGuide不仅显著提升了e2e FD-SLMs生成语义有意义且连贯语音的能力,而且在各种轮次事件上达到了最先进的性能。演示请访问此https URL。代码请访问此https URL。

英文摘要

Full-Duplex Speech Language Models (FD-SLMs) are specialized foundation models designed to enable natural, real-time spoken interactions by modeling complex conversational turn-taking such as interruptions, backchannels, and overlapping speech. End-to-end (e2e) FD-SLMs leverage real-world double-channel conversational data to capture nuanced two-speaker dialogue patterns for human-like interactions, but their conversational abilities often degrade compared to pure-text conversation due to prolonged speech sequences and limited high-quality spoken dialogue data. Although interleaved text-speech generation could mitigate this degradation, integrating discrete text tokens into continuous double-channel audio streams could disrupt the precise time alignment required for fluid interaction. To address this, we propose TurnGuide, a novel text-speech interleaved generation approach for e2e FD-SLMs that dynamically segments assistant speech into dialogue turns and interleaves turn-level text and speech generation. This approach allows FD-SLMs to integrate the semantic intelligence of LLMs without compromising the natural acoustic flow. Extensive experiments show that TurnGuide not only significantly improves e2e FD-SLMs to produce semantically meaningful, coherent speech but also achieves state-of-the-art performance on various turn-taking events. Demos are available at https://dreamtheater123.github.io/TurnGuide-Demo/. Code is available at https://github.com/dreamtheater123/TurnGuide.

7. 多模态音频与视听学习 2 篇

2606.19341 2026-06-18 cs.CV cs.CL cs.SD 交叉投稿

Native Active Perception as Reasoning for Omni-Modal Understanding

原生主动感知作为全模态理解的推理

Zhenghao Xing, Ruiyang Xu, Yuxuan Wang, Jinzheng He, Ziyang Ma, Qize Yang, Yunfei Chu, Jin Xu, Junyang Lin, Chi-Wing Fu, Pheng-Ann Heng

发表机构 * The Chinese University of Hong Kong(香港中文大学) Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) Qwen Team, Alibaba Group(阿里巴巴集团Qwen团队)

AI总结 提出OmniAgent,一种基于POMDP迭代观察-思考-行动循环的原生全模态智能体,通过主动感知将推理复杂度与视频时长解耦,在多个基准上达到开源模型最优性能。

Comments Accepted at ICML 2026. Code and models: https://github.com/harryhsing/omniagent

详情
AI中文摘要

用于长视频理解的被动模型通常依赖于“全看一遍”范式,无论查询难度如何都统一处理帧,导致计算成本随视频时长增长。尽管出现了交互式框架,但它们通常依赖于全局预扫描,其上下文成本仍随视频长度扩展。我们提出OmniAgent,第一个原生全模态智能体,将视频理解建模为基于POMDP的迭代观察-思考-行动循环。OmniAgent执行按需动作,选择性地将视听线索提炼到持久文本记忆中,有效将推理复杂度与原始视频时长解耦。为实现这一点,我们引入了(1)智能体监督微调,通过最佳N轨迹合成和双阶段质量控制在启动原生主动感知;(2)带TAURA(轮次感知自适应不确定性重缩放优势)的智能体强化学习,利用轮次级熵将信用分配引导至关键发现轮次。关键的是,OmniAgent表现出正向测试时缩放,性能随推理轮次增加而提升,验证了主动感知的有效性。在十个基准(如VideoMME、LVBench)上的实验结果表明,OmniAgent在开源模型中达到了最先进性能。值得注意的是,在LVBench上,我们的7B智能体优于10倍大的Qwen2.5-VL-72B(50.5% vs. 47.3%)。

英文摘要

Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10$\times$ larger Qwen2.5-VL-72B (50.5% vs. 47.3%).

2605.26672 2026-06-18 cs.MM cs.SD 版本更新

Can We Hear from Events? Generating Speech from Event Camera

我们能从事件中听到声音吗?从事件相机生成语音

Jingping Fang, Lin Chen, Chenyang Xu, Tong Zhao, Weidong Cai, Xiaoming Chen

发表机构 * Beijing Technology and Business University(北京技术与商业大学) Xidian University(西安电子科技大学) Tongji University(同济大学) University of Sydney(悉尼大学)

AI总结 提出EventSpeech框架,利用神经形态事件相机的高时间精度解决传统RGB语音生成中的时间粒度不匹配问题,实现情感丰富且抗运动模糊的语音生成。

详情
AI中文摘要

传统的基于RGB的语音生成面临时间粒度不匹配问题,因为固定的相机曝光时间不可避免地模糊了渲染情感语音所需的高频发音瞬态。为了打破这一限制,我们提出EventSpeech,这是一个新颖的文本条件框架,率先利用神经形态事件进行表达性语音生成,因为这些微秒级精确的事件自然与声学波形动态对齐。我们的架构集成了一个专用的事件编码器来建模稀疏的神经形态事件,以及一个多尺度音频编码器,其中包含分层小波上下文器(HWC)。双向对齐机制无缝地将语言内容和视觉动态与密集的声学特征同步。此外,我们构建了EVT-SPK作为第一个基准,包括大规模合成数据和来自专用神经形态硬件的真实世界记录。大量评估表明,EventSpeech通过保留细粒度情感和抵抗运动模糊,显著优于当前基线,为多模态语音生成建立了新范式。代码和演示可在https://xrfang-0102.github.io/EventSpeechWeb/获取。

英文摘要

Traditional RGB-based speech generation faces Temporal Granularity Mismatch since fixed camera exposure times inevitably blur the high-frequency articulatory transients essential for rendering emotional speech. To break this ceiling, we propose EventSpeech as a novel text-conditioned framework pioneering the use of neuromorphic events for expressive speech generation, since these microsecond-precise events naturally align with acoustic waveform dynamics. Our architecture integrates a dedicated Event Encoder to model sparse neuromorphic events alongside a multi-scale Audio Encoder featuring a Hierarchical Wavelet Contextualizer (HWC). A bidirectional alignment mechanism seamlessly synchronizes linguistic content and visual dynamics with dense acoustic features. Furthermore, we construct EVT-SPK as the first benchmark comprising large-scale synthetic data and real-world recordings from specialized neuromorphic hardware. Extensive evaluations demonstrate that EventSpeech significantly outperforms current baselines by preserving fine-grained emotions and resisting motion blur to establish a new paradigm for multimodal speech generation. Code and demo are available at https://xrfang-0102.github.io/EventSpeechWeb/.

8. 低资源、多语言与方言语音 2 篇

2606.18659 2026-06-18 cs.SD 新提交

Responsible ASR: Overcoming Challenges of Foundational Models in Narrow-Band and Low-Resource Settings

负责任的ASR:克服窄带和低资源场景下基础模型的挑战

Tejas Godambe, Nutan Choudhary, Sanket Shah, Nagaraj Adiga, Sharath Adavanne

发表机构 * Applied AI(应用人工智能)

AI总结 本文评估了开源和商业基础ASR模型在窄带对话中的表现,针对低资源语言印地语和低资源口音印度英语,发现零样本性能不佳,微调虽有改进但效果因语言和口音而异。

详情
AI中文摘要

全球电话对话通常通过窄带信道进行,且往往是自发和口语化的。本文评估了广泛使用的基础自动语音识别(ASR)模型——包括开源和商业模型——在窄带对话中的性能,针对低资源语言印地语和低资源口音印度英语。我们首先在零样本设置下评估这些模型,发现它们的性能整体上仍不理想。强调了ASR模型在窄带和低资源语言场景中面临的挑战后,我们进一步研究了使用有限真实标注录音对开源模型进行微调的影响。我们的发现表明,虽然微调带来了一些改进,但其效果因语言和口音而异,很大程度上受预训练期间遇到的数据量影响。

英文摘要

Telephony conversations worldwide are conducted over narrow-band channels and are often spontaneous and colloquial in nature. This paper evaluates the performance of widely used foundational automatic speech recognition (ASR) models -- both open-source and commercial -- on narrow-band conversations in Hindi, a low-resource language, and Indian-accented English, a low-resource accent. We first assess these models in a zero-shot setting and find that their performance remains suboptimal across the board. Highlighting the challenges faced by ASR models in narrow-band and low-resource language scenarios, we further investigate the impact of fine-tuning open-source models using a limited set of real-life annotated recordings. Our findings indicate that while fine-tuning provides some improvements, its effectiveness varies across languages and accents, largely influenced by the amount of data encountered during pretraining

2603.06310 2026-06-18 eess.AS cs.CL cs.SD 版本更新

Continual Adaptation for Pacific Indigenous Speech Recognition

太平洋土著语音识别的持续适应

Yang Xiao, Aso Mahmudi, Nick Thieberger, Eliathamby Ambikairajah, Eun-Jung Holden, Ting Dang

发表机构 * The University of Melbourne(墨尔本大学) UNSW Sydney(新南威尔士大学悉尼分校)

AI总结 针对太平洋土著语言数据稀缺和灾难性遗忘问题,研究语音基础模型的适应策略,发现LoRA在顺序学习中会灾难性遗忘,需定制鲁棒适应方法。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

语音基础模型在处理资源匮乏的太平洋土著语言时面临严重的数据稀缺问题。此外,完全微调存在灾难性遗忘的风险。为弥补这一空白,我们提出了一项实证研究,将模型适应到真实的太平洋数据集。我们研究了数据量、适应策略和表征漂移对多种太平洋语言语音基础模型的影响。此外,我们分析了一个用于顺序语言习得的持续学习框架。跨三种不同的太平洋土著语言的实证结果表明,适应这些语言距离较远的语言会引发严重的内部表征漂移。因此,这些模型面临严格的可塑性与稳定性困境。虽然LoRA初始适应良好,但在顺序学习过程中会出现灾难性遗忘。最终,本研究强调了为代表性不足的语言定制鲁棒适应策略的迫切需求。

英文摘要

Speech foundation models struggle with low-resource Pacific Indigenous languages because of severe data scarcity. Furthermore, full fine-tuning risks catastrophic forgetting. To address this gap, we present an empirical study adapting models to real-world Pacific datasets. We investigate the impact of data volume, adaptation strategies, and representational drift on speech foundation models for various Pacific languages. Additionally, we analyze a continual learning framework for sequential language acquisition. Empirical results across three distinct Pacific Indigenous languages demonstrate that adapting to these linguistically distant languages induces severe internal representational drift. Consequently, these models face a strict plasticity and stability dilemma. While LoRA adapts well initially, it suffers from catastrophic forgetting during sequential learning. Ultimately, this study highlights the urgent need for robust adaptation strategies tailored to underrepresented languages.

9. 数据集、基准与评测 1 篇

2603.05128 2026-06-18 eess.AS cs.SD 版本更新

PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio

PolyBench:多声部音频中组合推理的基准测试

Yuanjian Chen, Yang Xiao, Han Yin, Xubo Liu, Jinjie Huang, Ting Dang

发表机构 * Harbin University of Science and Technology(哈尔滨理工大学) The University of Melbourne(墨尔本大学) KAIST(韩国成均馆大学) University of Surrey(萨里大学)

AI总结 针对多声部音频中组合推理评估缺失的问题,提出PolyBench基准,包含计数、分类、检测、并发和时长估计五个子集,评估发现现有大音频语言模型在多声部场景下性能持续下降。

Comments Accepted by INTERSPEECH 2026

详情
AI中文摘要

大型音频语言模型(LALMs)在音频推理方面能力日益增强,然而现有基准对多声部音频(多个声音事件同时发生并产生组合结构)中的推理覆盖有限。为弥补这一空白,我们引入了PolyBench,这是一个旨在评估多声部音频中组合推理的基准,包含五个评估子集,涵盖计数、分类、检测、并发和时长估计,所有这些都需要对多个并发事件及其关系进行推理。我们对最先进的LALMs的评估揭示了在多声部设置中性能持续下降,表明当前LALMs存在根本性瓶颈。

英文摘要

Large Audio Language Models (LALMs) are increasingly capable of reasoning over audio, yet existing benchmarks offer limited coverage of reasoning in polyphonic audio, where multiple sound events co-occur and induce compositional structure. To address this gap, we introduce PolyBench, a benchmark designed to evaluate compositional reasoning in polyphonic audio, comprising five evaluation subsets that cover counting, classification, detection, concurrency, and duration estimation, all of which require reasoning over multiple concurrent events and their relations. Our evaluation of state-of-the-art LALMs reveals consistent performance degradation in polyphonic settings, indicating a fundamental bottleneck in current LALMs.

10. 安全、隐私与深度伪造音频 3 篇

2606.18738 2026-06-18 cs.SD 新提交

GRIDEX: Grid-Grounded Forensic Explanations for Deepfake Spectrogram Analysis

GRIDEX:基于网格的深度伪造频谱图取证解释

Thi Ngan Ha Do, Tingmin Wu, Alsharif Abuadbba, Kristen Moore

发表机构 * CSIRO(澳大利亚联邦科学与工业研究组织)

AI总结 提出GRIDEX框架,通过两阶段学习(SFT+GRPO)定位频谱图异常区域并生成结构化取证解释,提升伪造检测的可解释性。

详情
AI中文摘要

语音生成技术的进步使得人工语音越来越逼真。尽管现代分类模型在深度伪造检测方面可以达到高准确率,但它们不会产生证据,例如指出欺骗线索在频谱图中的位置及其声学含义,从而限制了它们在取证中的实用性。完整频谱图的人工分析是资源密集型的,因此证据应将注意力集中在最具诊断性的区域。此外,现有的可解释性方法在将上下文属性与局部证据联系起来方面的能力有限,使得解释更难验证。为了克服这一限制,我们提出了GRIDEX,这是一个流水线,当给定深度伪造频谱图时,它会生成其异常的取证解释。该流水线(i)选择频谱图中前K个异常区域,并(ii)为每个异常生成解释。这些解释遵循分类声学字段的模式,包括时间、频谱、语音信息和解释文本。据我们所知,这是第一个使用区域定位为深度伪造频谱图生成结构化取证解释的框架。GRIDEX采用两阶段学习范式进行训练,该范式将监督微调(SFT)与群体相对策略优化(GRPO)相结合。在我们的数据集上的实验表明,与强大的视觉语言模型(VLM)基线相比,伪影定位和解释质量有所提高。数据集和代码将在发表后发布。

英文摘要

The advancement of speech generation technologies has made artificial speech increasingly realistic. Although modern classification models can achieve high accuracy when it comes to deepfake detection, they do not produce evidences such as indicating where spoof cues appear in the spectrogram and what they imply acoustically, limiting their usefulness in forensic settings. Manual analysis of full spectrograms is resource-intensive, so evidence should narrow attention to the most diagnostic regions. Moreover, existing explainability methods have limited capabilities in connecting contextual attributes to localized evidence, making explanations harder to verify. To overcome this limitation, we propose GRIDEX, a pipeline that, when given a deepfake spectrogram, generates forensic explanations of its anomalies. The pipeline (i) selects top-K anomalous regions in the spectrogram and (ii) produces an explanation for each anomaly. The explanations follow a schema of categorical acoustic fields, including temporal, spectral, phonetic information and interpretation text. To our knowledge, this is the first framework to generate structured forensic explanations using regional grounding for deepfake spectrograms. GRIDEX is trained with a two-stage learning paradigm that combines supervised fine-tuning (SFT) with Group Relative Policy Optimization (GRPO). Experiments on our dataset show improved artifact localization and explanation quality over strong vision-language model (VLM) baselines. The dataset and code will be released upon publication.

2603.04865 2026-06-18 cs.SD 版本更新

The First Environmental Sound Deepfake Detection Challenge: Benchmarking Robustness, Evaluation, and Insights

环境声音深度伪造检测挑战赛:鲁棒性、评估与洞察的基准测试

Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Ting Dang

发表机构 * School of Electrical Engineering, KAIST, Daejeon, Republic of Korea(韩国成均馆大学电气工程学院) University of Melbourne, Australia(墨尔本大学) Fortemedia Singapore, Singapore(新加坡Fortemedia公司) Xi’an University of Posts & Telecommunications, Xi’an, China(西安邮电大学) Xi'an Lianfeng Acoustic Technologies Co., Ltd., China(西安联丰声学技术有限公司)

AI总结 本文介绍了环境声音深度伪造检测挑战赛,探讨了鲁棒性评估、系统架构及未来研究方向,提出了环境声音深度伪造检测的关键挑战与机遇。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

近年来,音频生成技术的进步使得创建高度逼真的环境声音景观变得更加容易,这可能被滥用于制造欺骗性内容,如假警报、枪声和人群声音,从而引发公众安全和信任的担忧。尽管语音和歌唱声的深度伪造检测已被广泛研究,但环境声音深度伪造检测(ESDD)仍处于探索阶段。为了推动ESDD的发展,首次ESDD挑战赛被启动,吸引了97支注册团队,收到了1748份有效提交。本文提出了该任务的定义、数据集构建、评估协议、基线系统以及挑战赛结果中的关键见解。此外,我们分析了高性能系统中常见的架构选择和训练策略。最后,我们讨论了ESDD的潜在未来研究方向,概述了关键机会和开放问题,以指导该领域后续研究。

英文摘要

Recent progress in audio generation has made it increasingly easy to create highly realistic environmental soundscapes, which can be misused to produce deceptive content, such as fake alarms, gunshots, and crowd sounds, raising concerns for public safety and trust. While deepfake detection for speech and singing voice has been extensively studied, environmental sound deepfake detection (ESDD) remains underexplored. To advance ESDD, the first edition of the ESDD challenge was launched, attracting 97 registered teams and receiving 1,748 valid submissions. This paper presents the task formulation, dataset construction, evaluation protocols, baseline systems, and key insights from the challenge results. Furthermore, we analyze common architectural choices and training strategies among top-performing systems. Finally, we discuss potential future research directions for ESDD, outlining key opportunities and open problems to guide subsequent studies in this field.

2602.04796 2026-06-18 eess.AS cs.SD 版本更新

LALM-as-a-Judge: Benchmarking Large Audio-Language Models for Safety Evaluation in Multi-Turn Spoken Dialogues

LALM-as-a-Judge:用于多轮口语对话安全评估的大型音频语言模型基准测试

Amir Ivry, Shinji Watanabe

发表机构 * Computer Engineering, Technion--Israel Institute of Technology, Haifa, Israel(技术学院电子工程系,技术离子技术研究所,以色列海法) Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA(语言技术研究所,卡内基梅隆大学,美国匹兹堡)

AI总结 针对口语对话中社会不安全内容评估仍以文本为中心、忽略韵律和转录失败的问题,提出包含24000个多轮口语对话的开放基准,评估6种大型音频语言模型在文本、音频和多模态设置下的敏感性、严重性顺序特异性和轮次位置偏差,发现音频提供非词汇证据,多模态增益非普遍且存在多种模式。

Comments Accepted to ICML 2026

详情
AI中文摘要

对口语对话中社会不安全内容的评估仍然以文本为中心,忽略了韵律和转录失败。我们提出了LALM-as-a-Judge,其中包括一个包含24000个多轮口语对话的开放基准,每个对话包含一个局部不安全轮次,这些对话基于8个社会不安全类别和5个严重级别生成。我们评估了6种大型音频语言模型(LALMs)作为评判者,包括开源和闭源模型,在纯文本、纯音频和多模态设置下,针对对话中社会有害内容的敏感性、严重性顺序特异性和轮次位置偏差。结果表明,音频提供了超越转录语义的非词汇证据,并且多模态增益并非普遍存在,而是可以表现为文本锚定、平衡、保守和干扰,我们将这些归因于音频路径瓶颈和融合限制。我们将该基准定位为诊断工具,并为模型、模态和提示选择提供实践者指导。

英文摘要

Evaluation of socially unsafe content in spoken dialogues remains text-centric, missing prosody and transcription failures. We present LALM-as-a-Judge, which includes an open benchmark of 24,000 multi-turn spoken dialogues with one localized unsafe turn, generated out of 8 socially unsafe categories and 5 severity levels. We evaluate 6 large audio-language models (LALMs) as judges, open and closed-source, in text-only, audio-only, and multimodal setups by their sensitivity, severity-order specificity, and turn-position bias for socially harmful content in the dialogue. Results show that audio contributes non-lexical evidence beyond transcript semantics and that multimodal gains are not universal but can be text-anchored, balanced, conservative, and interfering, which we link to the audio pathway bottlenecks and fusion limits. We position the benchmark as diagnostic and derive practitioner guidance for model, modality, and prompts choices.

11. 其他/综合语音音频 9 篇

2606.18560 2026-06-18 cs.SD 新提交

Constraining to Generalize: Subspace Tuning for Few-shot Generalization of Audio-Language Models

约束泛化:音频-语言模型少样本泛化的子空间微调

Jaehyuk Jang, Kangwook Ko, Wonjun Lee, Changick Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 针对音频-语言模型少样本微调导致的基类-新类权衡问题,提出子空间微调(SubT),通过结构化子空间参数化和残差锚定约束文本嵌入漂移,并利用子空间感知门控抑制负迁移,在11个音频基准上实现高效强泛化。

详情
AI中文摘要

预训练音频-语言模型(ALM)的少样本适应通常以牺牲未见类泛化为代价提高可见类性能,导致基类-新类权衡。我们将此失败归因于文本嵌入空间中的零样本漂移:少样本微调可能扭曲类间结构,并使适应后的嵌入远离其预训练锚点。因此,我们提出子空间微调(SubT),一种几何约束的适应框架,具有两种互补的漂移控制。结构化子空间参数化限制结构变形,残差锚定稳定围绕零样本先验的适应。在推理时,子空间感知门控进一步抑制弱对齐未见类的负迁移。在11个音频基准上,SubT在保持高效的同时实现了强大的少样本泛化,直接对预计算文本嵌入进行操作,无需文本编码器反向传播。

英文摘要

Few-shot adaptation of pretrained Audio--Language Models (ALMs) often improves seen-class performance at the cost of unseen-class generalization, leading to the base-to-new trade-off. We attribute this failure to zero-shot drift in the text embedding space: few-shot tuning can distort inter-class structure and move adapted embeddings far from their pretrained anchors. We therefore propose Subspace Tuning (SubT), a geometry-constrained adaptation framework with two complementary controls on drift. Structured Subspace Parameterization limits structural deformation, and Residual Anchoring stabilizes adaptation around the zero-shot prior. At inference time, Subspace-aware Gating further suppresses negative transfer for weakly aligned unseen classes. Across 11 audio benchmarks, SubT delivers strong few-shot generalization while remaining efficient, operating directly on precomputed text embeddings without text-encoder backpropagation.

2606.18266 2026-06-18 cs.HC cs.AI cs.SD 交叉投稿

EMORSION: Examining the Impact of Audio Parameters on Emotional Responses and Immersion in Film

EMORSION:检验音频参数对电影中情感反应和沉浸感的影响

Nelly Garcia, Ruby Crocker, Bleiz M Del Sette, Fabrizio Smeraldi, Charalampos Saitis, George Fazekas, Joshua Reiss

发表机构 * Queen Mary University of London(伦敦大学女王学院)

AI总结 通过操纵频率、动态和方向性三个音频参数,研究电影音频设计对观众情感和沉浸感的影响,发现细微变化可改变情感感知,非常规混音增加解读变异性。

Comments AES Europe 2026

详情
AI中文摘要

EMORSION 是一项探索性概念验证研究,旨在考察电影音频设计如何在影院环境中塑造观众的情感和沉浸感。选取了恐怖片(2部)和剧情片(2部)共四个电影场景,平衡主流与独立制作。针对每个场景,通过系统操纵音频设计的三个核心方面——频率(音高)、动态(响度)和方向性(空间位置),创建了多种替代音频混音。三组观众观看场景,每组观看每个场景的一个操纵混音和一个对照混音。通过三角化多模态框架评估观众反应,包括通过问卷自我报告的情感和沉浸感、心率监测等生理测量以及基于视频的运动追踪。该协议成功捕获了不同音频条件下可测量、可解释的差异,表明即使音频设计的细微变化也能塑造情感感知和沉浸感。非常规混音往往导致观众解读的更大变异性,而常规沉浸式混音则与更强的跨观众一致性相关。这些发现确立了 EMORSION 协议的可行性,并激励更大规模的研究来表征特定音频参数在塑造观众体验中的作用。

英文摘要

EMORSION is an exploratory proof-of-concept study examining how film audio design shapes audience emotion and immersion in acinema setting. Four film scenes were selected across the horror (2) and drama (2) genres, balanced between mainstream and independent productions. For each scene, multiple alternative audio mixes were created by systematically manipulating three core aspects of audio design, frequency (pitch), dynamics (loudness), and directionality (spatial placement). Three audience groups viewed the scenes, with each group exposed to one manipulated mix alongside a control mix for each scene. Audience responses were assessed through a triangulated multimodal framework combining self-reported emotion and immersion via a questionnaire, physiological measures including heart rate monitoring, and video-based motion tracking. The protocol successfully captured measurable, interpretable differences across audio conditions, indicating that even subtle changes in audio design can shape emotional perception and immersion. Unconventional mixes tended to produce greater variability in audience interpretation, while conventional immersive mixes were associated with stronger cross-audience agreement. These findings establish the feasibility of the EMORSION protocol and motivate larger-scale studies to characterise the role of specific audio parameters in shaping audience experience.

2606.18480 2026-06-18 eess.AS cs.SD 交叉投稿

Generalised Transcoding Framework for Arbitrary Spatial Audio Capture and Playback Formats

任意空间音频采集与回放格式的通用转码框架

Archontis Politis, Janani Fernandez, Leo McCormack

发表机构 * Faculty of Information Technology and Communication Sciences, Tampere University(信息科技与通讯科学学院,塔尔库大学) Department of Information and Communications Engineering, Aalto University(信息与通讯工程系,阿尔托大学)

AI总结 提出一种统一框架,通过估计时频域空间元数据(包括主成分和环境成分的角功率分布),实现从Ambisonic或原始麦克风阵列信号到任意目标回放格式的转码,支持独立旋转,实验证明其优于现有参数化渲染器。

Comments This work has been submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing for possible publication

详情
AI中文摘要

本文介绍了一种统一框架,用于对以Ambisonic信号或原始麦克风阵列信号形式捕获的空间声场景进行参数化分析和再现。所提出的方法估计时频相关的空间元数据,该元数据表征可变数量的主源分量和具有自身角功率分布的环境分量,其参数拟合捕获信号的观测空间协方差。该元数据用于构建目标回放格式的空间协方差,然后用于推导最优混合矩阵,以将场景转码用于目标再现系统上的回放。该方法还独立处理采集和回放设置的旋转。在听力测试中,使用来自Ambisonic、球形和头戴式阵列的模拟场景,比较了该方法的实时实现和其他现有的最先进参数化渲染器。结果突出了所提出框架在多种内容和接收器配置下的感知优势,特别是对于低阶和几何约束的麦克风阵列。

英文摘要

This article introduces a unified framework for the parametric analysis and reproduction of spatial sound scenes captured either as Ambisonic signals or as raw microphone array signals. The proposed method estimates time-frequency-dependent spatial metadata that characterises a variable number of primary source components and an ambience component with its own angular power distribution, whose parameters fit the observed spatial covariances of the captured signals. This metadata is used to construct spatial covariances of the target playback formats, which are then used to derive optimal mixing matrices for transcoding the scene for playback over the target reproduction system. The method additionally handles independent rotations of both capture and playback setups. Real-time implementations of the method and other existing state-of-the-art parametric renderers are compared in a listening test using simulated scenes from Ambisonic, spherical, and head-worn arrays. The results highlight perceptual benefits of the proposed framework across a diverse range of content and receiver configurations, particularly for lower-order and geometrically constrained microphone arrays.

2606.18571 2026-06-18 cs.LG cs.CL cs.SD eess.AS 交叉投稿

Fair Cognitive Impairment Detection Through Unlearning

通过去学习实现公平的认知障碍检测

William Nguyen, Jiali Cheng, Hadi Amiri

发表机构 * University of Massachusetts Lowell, USA(马萨诸塞大学洛厄尔分校)

AI总结 提出一种多模态框架,结合跨模态融合和梯度反转去学习,减少人口统计信息对轻度认知障碍检测的偏见,在跨语言数据集上缩小性能差距。

Comments Interspeech 2026

详情
AI中文摘要

轻度认知障碍(MCI)是一种以记忆、语言或思维能力显著下降为特征的医学状况。从自发语音中检测MCI对于可扩展的筛查具有前景。然而,学习模型常常利用与标签相关的人口统计线索,导致不同亚组之间存在较大的性能差距。我们提出了一种多模态框架,结合了(i)模态间(语音、文本和图像)的跨模型融合,以及(ii)使用梯度反转的去学习,该技术阻止共享嵌入编码与任务无关的人口统计属性。在多语言基准TAUKADIAL和PREPARE上的评估表明,我们的方法在MCI分类上优于最先进的多语言和多模态基线,同时显著缩小了患者亚组(性别和语言)之间的性能差距。我们进一步分析了跨数据集的迁移,表明人口统计去学习有助于学习更鲁棒的MCI检测表示。

英文摘要

Mild Cognitive Impairment (MCI) is a medical condition characterized by a noticeable decline in memory, language, or thinking abilities. MCI detection from spontaneous speech is promising for scalable screening. However, learned models often exploit demographic cues correlated with labels, resulting in a large performance gap across subgroups. We present a multimodal framework that combines (i) cross-model fusion between modalities (speech, text, and image), and (ii) unlearning using gradient reversal that discourages the shared embedding from encoding task-irrelevant demographic attributes. Evaluated on the multilingual benchmarks TAUKADIAL and PREPARE, our method outperforms the state-of-the-art multilingual and multimodal baseline in MCI classification while substantially reducing the performance gap across patient subgroups (sex and language). We further analyze transfer across datasets, showing that demographic unlearning helps learn more robust representations for MCI detection.

2606.18979 2026-06-18 eess.AS cs.CL cs.SD 交叉投稿

Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia Assessment

缓解语音痴呆评估中的评分错误并补偿非语言子测试

Franziska Braun, Christopher Witzl, Andreas Erzigkeit, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer

AI总结 研究通过融合转录分数和Whisper嵌入减少语音评估中的评分错误,并利用融合表示近似专家整体评分以补偿缺失的运动子测试,有效区分认知状态组。

Comments Accepted at INTERSPEECH 2026

详情
AI中文摘要

认知障碍的早期检测依赖于神经心理学测试,通过评估多个认知领域来最小化主观性。基于语音的评估可以支持诊断并提高可及性,但转录错误和非语言子测试(如运动技能)的遗漏限制了准确性。除了传统的测试分数,语音衍生特征可以提供对认知状态的额外见解。本研究调查了德国“综合征短测试”的语音评估,这是一种标准化的痴呆筛查测试,包含语言和运动子测试。我们训练模型,整合每个语言子测试的转录衍生分数和Whisper嵌入,以减少评分错误。为了补偿缺失的运动子测试,我们利用这些融合表示来近似专家整体评分。尽管省略了子测试,我们的模型与专家评分高度相关,并能有效且准确地区分认知状态组。

英文摘要

Early detection of cognitive impairment relies on neuropsychological tests to minimize subjectivity by assessing multiple cognitive domains. Speech-based evaluation can support diagnostics and improve accessibility, but transcription errors and the omission of nonverbal subtests (e.g., motor skills) limit accuracy. Beyond conventional test scores, speech-derived features can provide additional insights into cognitive status. This study investigates the speech-based evaluation of the German "Syndrom-Kurz-Test," a standardized dementia screening test comprising verbal and motor subtests. We train models that integrate transcript-derived scores and Whisper embeddings per verbal subtest to reduce scoring errors. To compensate for missing motor subtests, we then leverage these fused representations to approximate expert overall ratings. Despite omitting subtests, our models strongly correlate with expert ratings and efficiently and accurately discriminate between cognitive status groups.

2606.19039 2026-06-18 cs.NE cs.LG cs.SD 交叉投稿

Adaptive Speech-to-Spike Encoding for Spiking Neural Networks

自适应语音到脉冲编码用于脉冲神经网络

Taharim Rahman Anon, Jakaria Islam Emon

发表机构 * PI LLC(1 PI LLC)

AI总结 提出一种可学习的残差语音到脉冲编码器,与R-LIF骨干网络联合训练,在GSC-v2上达94.97%准确率,参数高效且学习任务对齐的脉冲表示。

Comments Accepted at Interspeech 2026. This version is a preprint

详情
AI中文摘要

连续声学信号与离散事件驱动处理之间的不匹配仍然是神经形态语音处理的基本瓶颈。当前系统通常依赖固定的脉冲编码器,迫使下游脉冲神经网络(SNN)补偿非自适应的输入表示。为了解决这个问题,我们提出了一种可学习的残差语音到脉冲编码器,与循环漏积分点火(R-LIF)骨干网络进行端到端联合训练。我们在Google Speech Commands v2(GSC-v2)基准上验证了该方法,达到了高达94.97%的准确率。值得注意的是,学习到的编码器仍然高度参数高效,其紧凑的35k参数变体达到了89.8%,匹配或超过了需要多一个数量级参数的先前基线。我们以编码器为中心的分析,包括线性探测和梯度残差检查,表明编码器并不追求忠实的信号重建,而是学习任务对齐的脉冲表示,增强了类别可分性。最后,我们通过比较直接反馈对齐(DFA)和替代梯度BPTT在相同架构和训练条件下的表现,对生物启发、硬件友好的信用分配进行了基准测试。我们发现DFA达到了91.5%的准确率,量化了生物启发学习规则在现代神经形态音频中的性能权衡。

英文摘要

The mismatch between continuous acoustic signals and discrete event-driven processing remains a fundamental bottleneck for neuromorphic speech processing. Current systems typically rely on fixed spike encoders, forcing downstream Spiking Neural Networks (SNNs) to compensate for non-adaptive input representations. To address this, we present a learnable residual speech-to-spike encoder jointly trained end-to-end with a Recurrent Leaky Integrate-and-Fire (R-LIF) backbone. We validate this approach on the Google Speech Commands v2 (GSC-v2) benchmark, achieving up to 94.97% accuracy. Notably, the learned encoder remains highly parameter-efficient with a compact 35k-parameter variant that reaches 89.8%, matching or exceeding prior baselines that require an order of magnitude more parameters. Our encoder-focused analysis, including linear probing and gradient-residual inspection, indicates that the encoder does not target faithful signal reconstruction but instead learns task-aligned spike representations that enhance class separability. Finally, we benchmark bio-inspired, hardware-friendly credit assignment by comparing Direct Feedback Alignment (DFA) with surrogate-gradient BPTT under identical architectures and training conditions. We find that DFA reaches 91.5% accuracy, quantifying the performance trade-off of bio-inspired learning rules for modern neuromorphic audio.

2604.18109 2026-06-18 cs.CL cs.SD 版本更新

FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings

FLiP:理解和解释多模态多语句子嵌入

Santosh Kesiraju, Bolaji Yusuf, Šimon Sedláček, Oldřich Plchot, Petr Schwarz

发表机构 * Brno University of Technology(布拉格技术大学)

AI总结 提出因子化线性投影(FLiP)模型,从多语言、多模态句子嵌入中恢复词汇内容,揭示编码器的模态和语言偏差。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

本文提出了因子化线性投影(FLiP)模型,用于理解预训练句子嵌入空间。我们训练FLiP模型从多语言(LaBSE)、多模态(SONAR)和基于API(Gemini)的句子嵌入空间中恢复多种高资源和中等资源语言的词汇内容。我们表明,FLiP可以从嵌入中召回超过75%的词汇内容,显著优于现有的非因子化基线。使用此作为诊断工具,我们揭示了所选句子编码器的模态和语言偏差,并为从业者提供了关于编码器的内在见解,而无需依赖传统的下游评估任务。我们的实现已公开,链接见此:https://this URL。

英文摘要

This paper presents factorized linear projection (FLiP) models for understanding pretrained sentence embedding spaces. We train FLiP models to recover the lexical content from multilingual (LaBSE), multimodal (SONAR) and API-based (Gemini) sentence embedding spaces in several high- and mid-resource languages. We show that FLiP can recall more than 75% of lexical content from the embeddings, significantly outperforming existing non-factorized baselines. Using this as a diagnostic tool, we uncover the modality and language biases across the selected sentence encoders and provide practitioners with intrinsic insights about the encoders without relying on conventional downstream evaluation tasks. Our implementation is public https://github.com/BUTSpeechFIT/FLiP.

2406.15537 2026-06-18 q-bio.NC cs.AI cs.SD eess.AS 版本更新

R&B -- Rhythm and Brain: Cross-subject Decoding of Music from Human Brain Activity

Matteo Ferrante, Matteo Ciferri, Nicola Toschi

发表机构 * Department of Biomedicine and Prevention University of Rome Tor Vergata(生物医学与预防系罗马大学托尔维加塔分校) A.A. Martinos Center for Biomedical Imaging Harvard Medical School/MGH, Boston (US)(A.A. Martinos生物医学成像中心哈佛医学院/马萨诸塞总医院,波士顿(美国))

Comments The first two authors contributed equally to this work

详情
Journal ref
Neural Networks, 203, 109195 (2026)
英文摘要

Music is a universal phenomenon that profoundly influences human experiences across cultures. This study investigates whether music can be decoded from human brain activity measured with functional MRI (fMRI) during its perception. Leveraging recent advancements in extensive datasets and pre-trained computational models, we construct mappings between neural data and latent representations of musical stimuli. Our approach integrates functional and anatomical alignment techniques to facilitate cross-subject decoding, addressing the challenges posed by the low temporal resolution and signal-to-noise ratio (SNR) in fMRI data. Starting from the GTZan fMRI dataset, where five participants listened to 540 musical stimuli from 10 different genres while their brain activity was recorded, we used the CLAP (Contrastive Language-Audio Pretraining) model to extract latent representations of the musical stimuli and developed voxel-wise encoding models to identify brain regions responsive to these stimuli. By applying a threshold to the association between predicted and actual brain activity, we identified specific regions of interest (ROIs) which can be interpreted as key players in music processing. Our decoding pipeline, primarily retrieval-based, employs a linear map to project brain activity to the corresponding CLAP features. This enables us to predict and retrieve the musical stimuli most similar to those that originated the fMRI data. Our results demonstrate state-of-the-art identification accuracy, with our methods significantly outperforming existing approaches. Our findings suggest that neural-based music retrieval systems could enable personalized recommendations and therapeutic applications. Future work could use higher temporal resolution neuroimaging and generative models to improve decoding accuracy and explore the neural underpinnings of music perception and emotion.

2206.05018 2026-06-18 cs.SD cs.CL eess.AS 版本更新

Going Beyond the Cookie Theft Picture Test: Detecting Cognitive Impairments using Acoustic Features

Franziska Braun, Andreas Erzigkeit, Hartmut Lehfeld, Thomas Hillemacher, Korbinian Riedhammer, Sebastian P. Bayerl

Comments Accepted at the 25th International Conference on Text, Speech and Dialogue (TSD 2022)

详情
Journal ref
Proceedings of the 25th International Conference on Text, Speech, and Dialogue (TSD 2022)
英文摘要

Standardized tests play a crucial role in the detection of cognitive impairment. Previous work demonstrated that automatic detection of cognitive impairment is possible using audio data from a standardized picture description task. The presented study goes beyond that, evaluating our methods on data taken from two standardized neuropsychological tests, namely the German SKT and a German version of the CERAD-NB, and a semi-structured clinical interview between a patient and a psychologist. For the tests, we focus on speech recordings of three sub-tests: reading numbers (SKT 3), interference (SKT 7), and verbal fluency (CERAD-NB 1). We show that acoustic features from standardized tests can be used to reliably discriminate cognitively impaired individuals from non-impaired ones. Furthermore, we provide evidence that even features extracted from random speech samples of the interview can be a discriminator of cognitive impairment. In our baseline experiments, we use OpenSMILE features and Support Vector Machine classifiers. In an improved setup, we show that using wav2vec 2.0 features instead, we can achieve an accuracy of up to 85%.