URL PDF HTML ☆

赞 0 踩 0

2606.19325 2026-06-18 cs.SD cs.AI cs.CV 新提交

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

参考驱动的野外先验多说话人音频场景生成

Michael Finkelson, Daniel Segal, Eitan Richardson, Shahar Armon, Nani Goldring, Poriya Panet, Nir Zabari, Benjamin Brazowski, Or Patashnik, Yoav HaCohen

发表机构 * Lightricks ； Tel Aviv University（特拉维夫大学）

AI总结提出ScenA方法，利用预训练的文本到音频流匹配基础模型，通过多参考声音和自然语言提示生成多说话人音频场景，并采用高噪声偏置时间步分布解决参考捷径问题，在CoVoMix2-Dialogue基准上优于现有系统。

Comments Project page at https://finmickey.github.io/scena/

详情

语音基础模型是否像人类一样感知说话人相似性？

Minoru Kishi, Hayato Yagi, Shinnosuke Takamichi, Yuki Saito

发表机构 * Keio University, Japan（庆应大学，日本）； The University of Tokyo, Japan（东京大学，日本）

AI总结本研究通过比较40多个语音基础模型的说话人嵌入与人类主观相似性评分，探究模型距离是否与人类感知一致，并识别影响模型与人类感知一致性的关键配置因素。

Comments Accepted by INTERSPEECH 2026. Camera-ready version

2606.18564 2026-06-18 cs.SD eess.SP 新提交

Reference-Based Recursive Least-Squares Mitigation of Real Interference in Stereo Audio Recordings

基于参考的递归最小二乘法在立体声音频录音中抑制真实干扰

Necati Kagan Erkek, Y. Ugur Ozcan

发表机构 * Telecommunications Engineering, Department of Electronics, Information（电信工程系，电子与信息系）

AI总结针对受真实火车噪声和环境背景污染的立体声音频，采用多参考递归最小二乘（RLS）估计器进行自适应干扰消除，通过参考信号估计干扰分量并减去，后接低通后置滤波器，有效降低参考相关性达30.6-34.1 dB。

Comments 7 pages

详情

AI中文摘要

评估了基于参考的自适应干扰消除方法，用于受真实火车噪声和环境背景污染的立体声音频录音。观测信号被建模为干净的立体声节目受到由外部声源通过未知传播路径产生的加性干扰污染。第二个立体声录音，代表同一物理噪声源的另一个滤波观测，被用作多参考递归最小二乘（RLS）估计器的参考输入。估计的火车干扰分量从含噪音频中减去，随后经过有限冲激响应低通后置滤波器。在相同算法参数下处理了三个74.01秒、采样率为11.025 kHz的真实音频序列。由于没有干净的参考真值，性能通过无参考指标评估：波形行为、Welch谱估计、RMS变化以及与参考的残差归一化相关性。每个参考通道使用30个抽头、15个反因果抽头和遗忘因子0.999，最大参考相关性从处理前的0.386--0.832降低到处理后的0.011--0.016。相应的相关性比降低约30.6--34.1 dB，而输出RMS根据片段和立体声通道减少1.8--4.8 dB。结果表明，当存在相关参考录音时，真实火车干扰（包括环境声学效应）可以被显著衰减。

闭环：用于符号音乐生成中可解释激活引导的PID反馈控制

Ioannis Prokopiou, Pantelis Vikatos, Maximos Kaliakatsos-Papakostas, Theodoros Giannakopoulos, Themos Stafylakis

发表机构 * Athens University of Economics and Business（雅典经济与商业大学）； Orfium Research（Orfium 研究）； Hellenic Mediterranean University（希腊地中海大学）； Archimedes / Athena Research Center（阿基米德/雅典娜研究中心）

AI总结提出基于PID反馈控制的推理时激活引导框架，通过差分均值法提取音高和时长潜在方向，并利用Gram-Schmidt正交化解耦多属性引导，实现符号音乐生成中细粒度、可解释的属性调制。

Comments Accepted at Learning to Listen: ICML 2026 Workshop on Machine Learning for Audio (43rd International Conference on Machine Learning - ICMLMLA26), 4 pages main (11 total), 2 figures

详情

AI中文摘要

基于Transformer的架构在生成复杂符号序列方面取得了显著进展，但在实现对离散信号属性的细粒度、可解释控制方面仍存在明显差距。本文研究了多轨音乐Transformer（MMT）的机制可解释性，并提出了一种无需重新训练即可通过推理时激活引导实现确定性属性调制的框架。利用差分均值（DiffMean）方法，我们在残差流中分离出信号属性（特别是音高和时长）的潜在方向。我们验证了该领域的线性表示假设，实现了引导幅度与属性偏移之间的高相关性。为了解决多属性引导中固有的特征纠缠问题，我们引入了一种利用Gram-Schmidt正交化的双引导框架。实验结果表明，与朴素向量加法相比，这种几何解耦减少了概念干扰和信号退化，即使在强自回归条件下也能实现独立的确定性控制。

英文摘要

Transformer-based architectures have significantly advanced the generation of complex symbolic sequences, yet a significant gap remains in achieving fine-grained, interpretable control over discrete signal attributes. This paper investigates the mechanistic interpretability of the Multitrack Music Transformer (MMT) and proposes a framework for deterministic attribute modulation without retraining to bridge this gap via inference-time activation steering. Utilizing the Difference-in-Means (DiffMean) methodology, we isolate latent directions for signal attributes, specifically Pitch and Duration, within the residual stream. We validate the Linear Representation Hypothesis in this domain, achieving high correlation between steering magnitude and attribute shift. To address the inherent feature entanglement in multi-attribute steering, we introduce a Dual Steering framework utilizing Gram-Schmidt Orthogonalization. Experimental results demonstrate that this geometric decoupling reduces conceptual interference and signal degradation compared to naive vector addition, enabling independent deterministic control even against strong autoregressive conditioning.

URL PDF HTML ☆

赞 0 踩 0

2606.15088 2026-06-18 cs.SD cs.CL eess.AS 版本更新

我们能从事件中听到声音吗？从事件相机生成语音

Jingping Fang, Lin Chen, Chenyang Xu, Tong Zhao, Weidong Cai, Xiaoming Chen

发表机构 * Beijing Technology and Business University（北京技术与商业大学）； Xidian University（西安电子科技大学）； Tongji University（同济大学）； University of Sydney（悉尼大学）

AI总结提出EventSpeech框架，利用神经形态事件相机的高时间精度解决传统RGB语音生成中的时间粒度不匹配问题，实现情感丰富且抗运动模糊的语音生成。

详情

AI中文摘要

传统的基于RGB的语音生成面临时间粒度不匹配问题，因为固定的相机曝光时间不可避免地模糊了渲染情感语音所需的高频发音瞬态。为了打破这一限制，我们提出EventSpeech，这是一个新颖的文本条件框架，率先利用神经形态事件进行表达性语音生成，因为这些微秒级精确的事件自然与声学波形动态对齐。我们的架构集成了一个专用的事件编码器来建模稀疏的神经形态事件，以及一个多尺度音频编码器，其中包含分层小波上下文器（HWC）。双向对齐机制无缝地将语言内容和视觉动态与密集的声学特征同步。此外，我们构建了EVT-SPK作为第一个基准，包括大规模合成数据和来自专用神经形态硬件的真实世界记录。大量评估表明，EventSpeech通过保留细粒度情感和抵抗运动模糊，显著优于当前基线，为多模态语音生成建立了新范式。代码和演示可在https://xrfang-0102.github.io/EventSpeechWeb/获取。

英文摘要

Traditional RGB-based speech generation faces Temporal Granularity Mismatch since fixed camera exposure times inevitably blur the high-frequency articulatory transients essential for rendering emotional speech. To break this ceiling, we propose EventSpeech as a novel text-conditioned framework pioneering the use of neuromorphic events for expressive speech generation, since these microsecond-precise events naturally align with acoustic waveform dynamics. Our architecture integrates a dedicated Event Encoder to model sparse neuromorphic events alongside a multi-scale Audio Encoder featuring a Hierarchical Wavelet Contextualizer (HWC). A bidirectional alignment mechanism seamlessly synchronizes linguistic content and visual dynamics with dense acoustic features. Furthermore, we construct EVT-SPK as the first benchmark comprising large-scale synthetic data and real-world recordings from specialized neuromorphic hardware. Extensive evaluations demonstrate that EventSpeech significantly outperforms current baselines by preserving fine-grained emotions and resisting motion blur to establish a new paradigm for multimodal speech generation. Code and demo are available at https://xrfang-0102.github.io/EventSpeechWeb/.

URL PDF HTML ☆

赞 0 踩 0

2606.18659 2026-06-18 cs.SD 新提交

Responsible ASR: Overcoming Challenges of Foundational Models in Narrow-Band and Low-Resource Settings

负责任的ASR：克服窄带和低资源场景下基础模型的挑战

Tejas Godambe, Nutan Choudhary, Sanket Shah, Nagaraj Adiga, Sharath Adavanne

发表机构 * Applied AI（应用人工智能）

AI总结本文评估了开源和商业基础ASR模型在窄带对话中的表现，针对低资源语言印地语和低资源口音印度英语，发现零样本性能不佳，微调虽有改进但效果因语言和口音而异。

2603.06310 2026-06-18 eess.AS cs.CL cs.SD 版本更新

Continual Adaptation for Pacific Indigenous Speech Recognition

太平洋土著语音识别的持续适应

Yang Xiao, Aso Mahmudi, Nick Thieberger, Eliathamby Ambikairajah, Eun-Jung Holden, Ting Dang

发表机构 * The University of Melbourne（墨尔本大学）； UNSW Sydney（新南威尔士大学悉尼分校）

AI总结针对太平洋土著语言数据稀缺和灾难性遗忘问题，研究语音基础模型的适应策略，发现LoRA在顺序学习中会灾难性遗忘，需定制鲁棒适应方法。

Comments Accepted by Interspeech 2026

详情

AI中文摘要

语音基础模型在处理资源匮乏的太平洋土著语言时面临严重的数据稀缺问题。此外，完全微调存在灾难性遗忘的风险。为弥补这一空白，我们提出了一项实证研究，将模型适应到真实的太平洋数据集。我们研究了数据量、适应策略和表征漂移对多种太平洋语言语音基础模型的影响。此外，我们分析了一个用于顺序语言习得的持续学习框架。跨三种不同的太平洋土著语言的实证结果表明，适应这些语言距离较远的语言会引发严重的内部表征漂移。因此，这些模型面临严格的可塑性与稳定性困境。虽然LoRA初始适应良好，但在顺序学习过程中会出现灾难性遗忘。最终，本研究强调了为代表性不足的语言定制鲁棒适应策略的迫切需求。

英文摘要

Speech foundation models struggle with low-resource Pacific Indigenous languages because of severe data scarcity. Furthermore, full fine-tuning risks catastrophic forgetting. To address this gap, we present an empirical study adapting models to real-world Pacific datasets. We investigate the impact of data volume, adaptation strategies, and representational drift on speech foundation models for various Pacific languages. Additionally, we analyze a continual learning framework for sequential language acquisition. Empirical results across three distinct Pacific Indigenous languages demonstrate that adapting to these linguistically distant languages induces severe internal representational drift. Consequently, these models face a strict plasticity and stability dilemma. While LoRA adapts well initially, it suffers from catastrophic forgetting during sequential learning. Ultimately, this study highlights the urgent need for robust adaptation strategies tailored to underrepresented languages.

URL PDF HTML ☆

赞 0 踩 0

2603.05128 2026-06-18 eess.AS cs.SD 版本更新

PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio

PolyBench：多声部音频中组合推理的基准测试

Yuanjian Chen, Yang Xiao, Han Yin, Xubo Liu, Jinjie Huang, Ting Dang

发表机构 * Harbin University of Science and Technology（哈尔滨理工大学）； The University of Melbourne（墨尔本大学）； KAIST（韩国成均馆大学）； University of Surrey（萨里大学）

AI总结针对多声部音频中组合推理评估缺失的问题，提出PolyBench基准，包含计数、分类、检测、并发和时长估计五个子集，评估发现现有大音频语言模型在多声部场景下性能持续下降。

Comments Accepted by INTERSPEECH 2026

2606.18738 2026-06-18 cs.SD 新提交

约束泛化：音频-语言模型少样本泛化的子空间微调

Jaehyuk Jang, Kangwook Ko, Wonjun Lee, Changick Kim

发表机构 * KAIST（韩国科学技术院）

AI总结针对音频-语言模型少样本微调导致的基类-新类权衡问题，提出子空间微调（SubT），通过结构化子空间参数化和残差锚定约束文本嵌入漂移，并利用子空间感知门控抑制负迁移，在11个音频基准上实现高效强泛化。

2606.18266 2026-06-18 cs.HC cs.AI cs.SD 交叉投稿

EMORSION: Examining the Impact of Audio Parameters on Emotional Responses and Immersion in Film

EMORSION：检验音频参数对电影中情感反应和沉浸感的影响

Nelly Garcia, Ruby Crocker, Bleiz M Del Sette, Fabrizio Smeraldi, Charalampos Saitis, George Fazekas, Joshua Reiss

发表机构 * Queen Mary University of London（伦敦大学女王学院）

AI总结通过操纵频率、动态和方向性三个音频参数，研究电影音频设计对观众情感和沉浸感的影响，发现细微变化可改变情感感知，非常规混音增加解读变异性。

Comments AES Europe 2026

详情

AI中文摘要

EMORSION 是一项探索性概念验证研究，旨在考察电影音频设计如何在影院环境中塑造观众的情感和沉浸感。选取了恐怖片（2部）和剧情片（2部）共四个电影场景，平衡主流与独立制作。针对每个场景，通过系统操纵音频设计的三个核心方面——频率（音高）、动态（响度）和方向性（空间位置），创建了多种替代音频混音。三组观众观看场景，每组观看每个场景的一个操纵混音和一个对照混音。通过三角化多模态框架评估观众反应，包括通过问卷自我报告的情感和沉浸感、心率监测等生理测量以及基于视频的运动追踪。该协议成功捕获了不同音频条件下可测量、可解释的差异，表明即使音频设计的细微变化也能塑造情感感知和沉浸感。非常规混音往往导致观众解读的更大变异性，而常规沉浸式混音则与更强的跨观众一致性相关。这些发现确立了 EMORSION 协议的可行性，并激励更大规模的研究来表征特定音频参数在塑造观众体验中的作用。

英文摘要

EMORSION is an exploratory proof-of-concept study examining how film audio design shapes audience emotion and immersion in acinema setting. Four film scenes were selected across the horror (2) and drama (2) genres, balanced between mainstream and independent productions. For each scene, multiple alternative audio mixes were created by systematically manipulating three core aspects of audio design, frequency (pitch), dynamics (loudness), and directionality (spatial placement). Three audience groups viewed the scenes, with each group exposed to one manipulated mix alongside a control mix for each scene. Audience responses were assessed through a triangulated multimodal framework combining self-reported emotion and immersion via a questionnaire, physiological measures including heart rate monitoring, and video-based motion tracking. The protocol successfully captured measurable, interpretable differences across audio conditions, indicating that even subtle changes in audio design can shape emotional perception and immersion. Unconventional mixes tended to produce greater variability in audience interpretation, while conventional immersive mixes were associated with stronger cross-audience agreement. These findings establish the feasibility of the EMORSION protocol and motivate larger-scale studies to characterise the role of specific audio parameters in shaping audience experience.

URL PDF HTML ☆

赞 0 踩 0

2606.18480 2026-06-18 eess.AS cs.SD 交叉投稿

Generalised Transcoding Framework for Arbitrary Spatial Audio Capture and Playback Formats

任意空间音频采集与回放格式的通用转码框架

Archontis Politis, Janani Fernandez, Leo McCormack

发表机构 * Faculty of Information Technology and Communication Sciences, Tampere University（信息科技与通讯科学学院，塔尔库大学）； Department of Information and Communications Engineering, Aalto University（信息与通讯工程系，阿尔托大学）

AI总结提出一种统一框架，通过估计时频域空间元数据（包括主成分和环境成分的角功率分布），实现从Ambisonic或原始麦克风阵列信号到任意目标回放格式的转码，支持独立旋转，实验证明其优于现有参数化渲染器。

Comments This work has been submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing for possible publication

详情

AI中文摘要

本文介绍了一种统一框架，用于对以Ambisonic信号或原始麦克风阵列信号形式捕获的空间声场景进行参数化分析和再现。所提出的方法估计时频相关的空间元数据，该元数据表征可变数量的主源分量和具有自身角功率分布的环境分量，其参数拟合捕获信号的观测空间协方差。该元数据用于构建目标回放格式的空间协方差，然后用于推导最优混合矩阵，以将场景转码用于目标再现系统上的回放。该方法还独立处理采集和回放设置的旋转。在听力测试中，使用来自Ambisonic、球形和头戴式阵列的模拟场景，比较了该方法的实时实现和其他现有的最先进参数化渲染器。结果突出了所提出框架在多种内容和接收器配置下的感知优势，特别是对于低阶和几何约束的麦克风阵列。

英文摘要

This article introduces a unified framework for the parametric analysis and reproduction of spatial sound scenes captured either as Ambisonic signals or as raw microphone array signals. The proposed method estimates time-frequency-dependent spatial metadata that characterises a variable number of primary source components and an ambience component with its own angular power distribution, whose parameters fit the observed spatial covariances of the captured signals. This metadata is used to construct spatial covariances of the target playback formats, which are then used to derive optimal mixing matrices for transcoding the scene for playback over the target reproduction system. The method additionally handles independent rotations of both capture and playback setups. Real-time implementations of the method and other existing state-of-the-art parametric renderers are compared in a listening test using simulated scenes from Ambisonic, spherical, and head-worn arrays. The results highlight perceptual benefits of the proposed framework across a diverse range of content and receiver configurations, particularly for lower-order and geometrically constrained microphone arrays.

URL PDF HTML ☆

赞 0 踩 0

2606.18571 2026-06-18 cs.LG cs.CL cs.SD eess.AS 交叉投稿

Fair Cognitive Impairment Detection Through Unlearning

通过去学习实现公平的认知障碍检测

William Nguyen, Jiali Cheng, Hadi Amiri

发表机构 * University of Massachusetts Lowell, USA（马萨诸塞大学洛厄尔分校）

AI总结提出一种多模态框架，结合跨模态融合和梯度反转去学习，减少人口统计信息对轻度认知障碍检测的偏见，在跨语言数据集上缩小性能差距。

Comments Interspeech 2026

详情

AI中文摘要

轻度认知障碍（MCI）是一种以记忆、语言或思维能力显著下降为特征的医学状况。从自发语音中检测MCI对于可扩展的筛查具有前景。然而，学习模型常常利用与标签相关的人口统计线索，导致不同亚组之间存在较大的性能差距。我们提出了一种多模态框架，结合了（i）模态间（语音、文本和图像）的跨模型融合，以及（ii）使用梯度反转的去学习，该技术阻止共享嵌入编码与任务无关的人口统计属性。在多语言基准TAUKADIAL和PREPARE上的评估表明，我们的方法在MCI分类上优于最先进的多语言和多模态基线，同时显著缩小了患者亚组（性别和语言）之间的性能差距。我们进一步分析了跨数据集的迁移，表明人口统计去学习有助于学习更鲁棒的MCI检测表示。

英文摘要

Mild Cognitive Impairment (MCI) is a medical condition characterized by a noticeable decline in memory, language, or thinking abilities. MCI detection from spontaneous speech is promising for scalable screening. However, learned models often exploit demographic cues correlated with labels, resulting in a large performance gap across subgroups. We present a multimodal framework that combines (i) cross-model fusion between modalities (speech, text, and image), and (ii) unlearning using gradient reversal that discourages the shared embedding from encoding task-irrelevant demographic attributes. Evaluated on the multilingual benchmarks TAUKADIAL and PREPARE, our method outperforms the state-of-the-art multilingual and multimodal baseline in MCI classification while substantially reducing the performance gap across patient subgroups (sex and language). We further analyze transfer across datasets, showing that demographic unlearning helps learn more robust representations for MCI detection.

URL PDF HTML ☆

赞 0 踩 0

2606.18979 2026-06-18 eess.AS cs.CL cs.SD 交叉投稿

Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia Assessment

缓解语音痴呆评估中的评分错误并补偿非语言子测试

Franziska Braun, Christopher Witzl, Andreas Erzigkeit, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer

AI总结研究通过融合转录分数和Whisper嵌入减少语音评估中的评分错误，并利用融合表示近似专家整体评分以补偿缺失的运动子测试，有效区分认知状态组。

Comments Accepted at INTERSPEECH 2026

详情

AI中文摘要

认知障碍的早期检测依赖于神经心理学测试，通过评估多个认知领域来最小化主观性。基于语音的评估可以支持诊断并提高可及性，但转录错误和非语言子测试（如运动技能）的遗漏限制了准确性。除了传统的测试分数，语音衍生特征可以提供对认知状态的额外见解。本研究调查了德国“综合征短测试”的语音评估，这是一种标准化的痴呆筛查测试，包含语言和运动子测试。我们训练模型，整合每个语言子测试的转录衍生分数和Whisper嵌入，以减少评分错误。为了补偿缺失的运动子测试，我们利用这些融合表示来近似专家整体评分。尽管省略了子测试，我们的模型与专家评分高度相关，并能有效且准确地区分认知状态组。

英文摘要

Early detection of cognitive impairment relies on neuropsychological tests to minimize subjectivity by assessing multiple cognitive domains. Speech-based evaluation can support diagnostics and improve accessibility, but transcription errors and the omission of nonverbal subtests (e.g., motor skills) limit accuracy. Beyond conventional test scores, speech-derived features can provide additional insights into cognitive status. This study investigates the speech-based evaluation of the German "Syndrom-Kurz-Test," a standardized dementia screening test comprising verbal and motor subtests. We train models that integrate transcript-derived scores and Whisper embeddings per verbal subtest to reduce scoring errors. To compensate for missing motor subtests, we then leverage these fused representations to approximate expert overall ratings. Despite omitting subtests, our models strongly correlate with expert ratings and efficiently and accurately discriminate between cognitive status groups.

URL PDF HTML ☆

赞 0 踩 0

2606.19039 2026-06-18 cs.NE cs.LG cs.SD 交叉投稿

Adaptive Speech-to-Spike Encoding for Spiking Neural Networks

自适应语音到脉冲编码用于脉冲神经网络

Taharim Rahman Anon, Jakaria Islam Emon

发表机构 * PI LLC（1 PI LLC）

AI总结提出一种可学习的残差语音到脉冲编码器，与R-LIF骨干网络联合训练，在GSC-v2上达94.97%准确率，参数高效且学习任务对齐的脉冲表示。

Comments Accepted at Interspeech 2026. This version is a preprint

详情

AI中文摘要

连续声学信号与离散事件驱动处理之间的不匹配仍然是神经形态语音处理的基本瓶颈。当前系统通常依赖固定的脉冲编码器，迫使下游脉冲神经网络（SNN）补偿非自适应的输入表示。为了解决这个问题，我们提出了一种可学习的残差语音到脉冲编码器，与循环漏积分点火（R-LIF）骨干网络进行端到端联合训练。我们在Google Speech Commands v2（GSC-v2）基准上验证了该方法，达到了高达94.97%的准确率。值得注意的是，学习到的编码器仍然高度参数高效，其紧凑的35k参数变体达到了89.8%，匹配或超过了需要多一个数量级参数的先前基线。我们以编码器为中心的分析，包括线性探测和梯度残差检查，表明编码器并不追求忠实的信号重建，而是学习任务对齐的脉冲表示，增强了类别可分性。最后，我们通过比较直接反馈对齐（DFA）和替代梯度BPTT在相同架构和训练条件下的表现，对生物启发、硬件友好的信用分配进行了基准测试。我们发现DFA达到了91.5%的准确率，量化了生物启发学习规则在现代神经形态音频中的性能权衡。

英文摘要

The mismatch between continuous acoustic signals and discrete event-driven processing remains a fundamental bottleneck for neuromorphic speech processing. Current systems typically rely on fixed spike encoders, forcing downstream Spiking Neural Networks (SNNs) to compensate for non-adaptive input representations. To address this, we present a learnable residual speech-to-spike encoder jointly trained end-to-end with a Recurrent Leaky Integrate-and-Fire (R-LIF) backbone. We validate this approach on the Google Speech Commands v2 (GSC-v2) benchmark, achieving up to 94.97% accuracy. Notably, the learned encoder remains highly parameter-efficient with a compact 35k-parameter variant that reaches 89.8%, matching or exceeding prior baselines that require an order of magnitude more parameters. Our encoder-focused analysis, including linear probing and gradient-residual inspection, indicates that the encoder does not target faithful signal reconstruction but instead learns task-aligned spike representations that enhance class separability. Finally, we benchmark bio-inspired, hardware-friendly credit assignment by comparing Direct Feedback Alignment (DFA) with surrogate-gradient BPTT under identical architectures and training conditions. We find that DFA reaches 91.5% accuracy, quantifying the performance trade-off of bio-inspired learning rules for modern neuromorphic audio.

URL PDF HTML ☆

赞 0 踩 0

2604.18109 2026-06-18 cs.CL cs.SD 版本更新

FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings

FLiP：理解和解释多模态多语句子嵌入

Santosh Kesiraju, Bolaji Yusuf, Šimon Sedláček, Oldřich Plchot, Petr Schwarz

发表机构 * Brno University of Technology（布拉格技术大学）

AI总结提出因子化线性投影（FLiP）模型，从多语言、多模态句子嵌入中恢复词汇内容，揭示编码器的模态和语言偏差。

Comments Accepted to Interspeech 2026

2406.15537 2026-06-18 q-bio.NC cs.AI cs.SD eess.AS 版本更新

R&B -- Rhythm and Brain: Cross-subject Decoding of Music from Human Brain Activity

Matteo Ferrante, Matteo Ciferri, Nicola Toschi

发表机构 * Department of Biomedicine and Prevention University of Rome Tor Vergata（生物医学与预防系罗马大学托尔维加塔分校）； A.A. Martinos Center for Biomedical Imaging Harvard Medical School/MGH, Boston (US)（A.A. Martinos生物医学成像中心哈佛医学院/马萨诸塞总医院，波士顿（美国））

Comments The first two authors contributed equally to this work

详情

DOI: 10.1016/j.neunet.2026.109195
Journal ref: Neural Networks, 203, 109195 (2026)

英文摘要

Music is a universal phenomenon that profoundly influences human experiences across cultures. This study investigates whether music can be decoded from human brain activity measured with functional MRI (fMRI) during its perception. Leveraging recent advancements in extensive datasets and pre-trained computational models, we construct mappings between neural data and latent representations of musical stimuli. Our approach integrates functional and anatomical alignment techniques to facilitate cross-subject decoding, addressing the challenges posed by the low temporal resolution and signal-to-noise ratio (SNR) in fMRI data. Starting from the GTZan fMRI dataset, where five participants listened to 540 musical stimuli from 10 different genres while their brain activity was recorded, we used the CLAP (Contrastive Language-Audio Pretraining) model to extract latent representations of the musical stimuli and developed voxel-wise encoding models to identify brain regions responsive to these stimuli. By applying a threshold to the association between predicted and actual brain activity, we identified specific regions of interest (ROIs) which can be interpreted as key players in music processing. Our decoding pipeline, primarily retrieval-based, employs a linear map to project brain activity to the corresponding CLAP features. This enables us to predict and retrieve the musical stimuli most similar to those that originated the fMRI data. Our results demonstrate state-of-the-art identification accuracy, with our methods significantly outperforming existing approaches. Our findings suggest that neural-based music retrieval systems could enable personalized recommendations and therapeutic applications. Future work could use higher temporal resolution neuroimaging and generative models to improve decoding accuracy and explore the neural underpinnings of music perception and emotion.

URL PDF HTML ☆

赞 0 踩 0

2206.05018 2026-06-18 cs.SD cs.CL eess.AS 版本更新

Going Beyond the Cookie Theft Picture Test: Detecting Cognitive Impairments using Acoustic Features

Franziska Braun, Andreas Erzigkeit, Hartmut Lehfeld, Thomas Hillemacher, Korbinian Riedhammer, Sebastian P. Bayerl

Comments Accepted at the 25th International Conference on Text, Speech and Dialogue (TSD 2022)

1. 语音合成与声音生成 6 篇

Reliable Neural-Codec Text-to-Speech by ASR Self-Verification and Distillation: Near-Zero Catastrophic Failures Across Models and Codecs

MagpieTTS-LF: Inference-Time Long-Form Speech Generation Without Training on Long-Form data

FineCombo-TTS: Collaborative and Precise Controllable Speech Synthesis Using Text Descriptions and Reference Speech

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Discrete Flow Matching

Phonikud: Overcoming Phonetic Underspecification for Hebrew Text-To-Speech

2. 说话人识别、验证与分离 2 篇

Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation

Do speech foundation models perceive speaker similarity as humans do?

3. 语音增强、降噪与音频修复 2 篇

Reference-Based Recursive Least-Squares Mitigation of Real Interference in Stereo Audio Recordings

QC-GAN: A Parameter-Efficient Quaternion Conformer GAN for High-Fidelity Speech Enhancement

4. 音频事件检测与场景理解 2 篇

NeuralMUSIC: A Hybrid Neural-Subspace Framework for Robot Sound Source Localization

Scoring Backends Matter More Than Pooling: A Systematic Study of Training-Free Anomalous Sound Detection under Domain Shift

5. 音乐信息检索与音乐生成 2 篇

Closing the Loop: PID Feedback Control for Interpretable Activation Steering in Symbolic Music Generation

When the Same Musical Knowledge Forgets Differently: A Clean Probe of Pathway-Dependent Forgetting

6. 语音翻译与语音语言模型 3 篇

Who Wins the Conflict? Mechanistic Interpretability of Text Bias in Audio LLMs

Continuous Audio Thinking for Large Audio Language Models

TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving

7. 多模态音频与视听学习 2 篇

Native Active Perception as Reasoning for Omni-Modal Understanding

Can We Hear from Events? Generating Speech from Event Camera

8. 低资源、多语言与方言语音 2 篇

Responsible ASR: Overcoming Challenges of Foundational Models in Narrow-Band and Low-Resource Settings

Continual Adaptation for Pacific Indigenous Speech Recognition

9. 数据集、基准与评测 1 篇

PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio

10. 安全、隐私与深度伪造音频 3 篇

GRIDEX: Grid-Grounded Forensic Explanations for Deepfake Spectrogram Analysis

The First Environmental Sound Deepfake Detection Challenge: Benchmarking Robustness, Evaluation, and Insights

LALM-as-a-Judge: Benchmarking Large Audio-Language Models for Safety Evaluation in Multi-Turn Spoken Dialogues

11. 其他/综合语音音频 9 篇

Constraining to Generalize: Subspace Tuning for Few-shot Generalization of Audio-Language Models

EMORSION: Examining the Impact of Audio Parameters on Emotional Responses and Immersion in Film

Generalised Transcoding Framework for Arbitrary Spatial Audio Capture and Playback Formats

Fair Cognitive Impairment Detection Through Unlearning

Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia Assessment

Adaptive Speech-to-Spike Encoding for Spiking Neural Networks

FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings

R&B -- Rhythm and Brain: Cross-subject Decoding of Music from Human Brain Activity

Going Beyond the Cookie Theft Picture Test: Detecting Cognitive Impairments using Acoustic Features