arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 语音合成与声音生成 2 篇

2509.09631 2026-06-18 cs.SD cs.CL cs.CV 版本更新

DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Discrete Flow Matching

DiFlow-TTS: 基于离散流匹配的紧凑低延迟零样本文本转语音

Ngoc-Son Nguyen, Thanh V. T. Tran, Hieu-Nghia Huynh-Nguyen, Truong-Son Hy, Van Nguyen

AI总结 提出DiFlow-TTS框架,通过离散流匹配和分解离散流去噪器,在零样本TTS中实现高质量与低延迟的平衡。

Comments Accepted at Interspeech 2026 (Long Paper Track)

详情
AI中文摘要

零样本文本转语音(TTS)在复制未见过的声音方面取得了显著进展,但平衡生成质量和推理效率仍然具有挑战性。自回归模型存在高延迟问题,而基于扩散的方法受限于训练时的配置。此外,大多数基于流的方法在连续空间中运行,由于连续令牌空间本质上比离散空间更复杂,这引入了优化挑战。为了解决这些限制,我们提出了DiFlow-TTS,一种基于离散流匹配的新型零样本TTS框架。该模型由一个用于语言建模的确定性音素-内容映射器和一个同时生成韵律和声学令牌流的分解离散流去噪器组成。实验结果表明了我们的方法在多个评估指标上的有效性。

英文摘要

Zero-shot text-to-speech (TTS) has made significant progress in replicating unseen voices, yet balancing generation quality and inference efficiency remains challenging. Autoregressive models suffer from high latency, while diffusion-based approaches are constrained by training-time configurations. Moreover, most flow-based methods operate in continuous space, which introduces optimization challenges because continuous token spaces are inherently more complex than discrete ones. To address these limitations, we propose DiFlow-TTS, a novel zero-shot TTS framework based on discrete flow matching. The model consists of a deterministic Phoneme-Content Mapper for linguistic modeling and a Factorized Discrete Flow Denoiser that simultaneously generates prosody and acoustic token streams. Experimental results demonstrate the effectiveness of our approach across multiple evaluation metrics.

2506.12311 2026-06-18 cs.CL cs.SD eess.AS 版本更新

Phonikud: Overcoming Phonetic Underspecification for Hebrew Text-To-Speech

Phonikud:克服希伯来语文本转语音中的语音欠指定问题

Yakov Kolani, Maxim Melichov, Cobi Calev, Morris Alper

发表机构 * Independent Researcher(独立研究者) Reichman University(雷赫曼大学) Tel Aviv University(特拉维夫大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出Phonikud框架,通过开源G2P系统、语料库、基准和评估模型,解决希伯来语TTS中重音等语音特征欠指定问题,实现更准确的音素预测。

Comments Accepted to Interspeech 2026. Project page: https://phonikud.github.io

详情
AI中文摘要

现代希伯来语的文本转语音(TTS)受到该语言正字法复杂性的挑战,现有解决方案忽略了诸如重音等欠指定的语音特征。我们提出了一个更准确的希伯来语TTS框架,包含四个贡献:(1)Phonikud,一个开源的希伯来语字素到音素(G2P)系统,输出完全指定的国际音标(IPA)转录,通过增强基础注音器设计而成。(2)ILSpeech语料库,包含配对的希伯来语音频、文本和专家IPA标注。(3)针对先前未测量的希伯来语G2P转换任务的基准。(4)希伯来语音频到IPA模型,捕获先前忽略的语音细节,用于自动TTS评估。我们的结果表明,Phonikud比先前方法更准确地预测希伯来语音素,并且使用Phonikud语音输入的小型本地TTS模型接近大型专有系统。我们在以下网址发布代码、数据和模型:this https URL。

英文摘要

Text-to-speech (TTS) for Modern Hebrew is challenged by the language's orthographic complexity, with existing solutions ignoring underspecified phonetic features such as stress. We present a framework for more phonetically accurate Hebrew TTS with four contributions: (1) Phonikud, an open-source Hebrew grapheme-to-phoneme (G2P) system that outputs fully-specified International Phonetic Alphabet (IPA) transcriptions, designed by augmenting a base diacritizer. (2) The ILSpeech corpus of paired Hebrew audio, text, and expert IPA annotations. (3) A benchmark for the previously unmeasured task of Hebrew G2P conversion. (4) Hebrew audio-to-IPA models capturing previously disregarded phonetic details for automatic TTS evaluation. Our results show that Phonikud more accurately predicts Hebrew phonemes than prior methods, and that small, local TTS models with phonetic input from Phonikud approach large proprietary systems. We release our code, data, and models at https://phonikud.github.io.

2. 说话人识别、验证与分离 2 篇

2603.10827 2026-06-18 cs.SD cs.AI 版本更新

Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation

语音感知大语言模型的说话人验证:评估与增强

Thomas Thebaud, Yuzhe Wang, Laureano Moro-Velazquez, Jesus Villalba-Lopez, Najim Dehak

发表机构 * Electrical and Computer Engineering Department, Johns Hopkins University, Baltimore, MD, USA(约翰霍普金斯大学电气与计算机工程系) Human Language Technology Center of Excellence, Johns Hopkins University, Baltimore, MD, USA(约翰霍普金斯大学人机语言技术中心卓越中心)

AI总结 提出模型无关的评分协议评估语音感知LLM的说话人区分能力(EER>20%),并通过注入冻结的ECAPA-TDNN说话人嵌入和LoRA微调,实现接近专用系统的性能(EER 1.03%)。

Comments 3 Tables, 1 Figure, Published in Interspeech 2026

详情
AI中文摘要

语音感知大语言模型(LLMs)可以接受语音输入,但其训练目标主要强调语言内容或特定领域(如情感或说话人性别),尚不清楚它们是否编码了说话人身份。首先,我们提出了一种模型无关的评分协议,该协议利用Yes/No令牌概率的置信度分数或对数似然比,为仅API模型和开放权重模型生成连续验证分数。使用该协议,我们评估了最近的语音感知LLMs,观察到较弱的说话人区分能力(在VoxCeleb1上EER高于20%)。其次,我们引入了一种轻量级增强方法,通过可学习的投影注入冻结的ECAPA-TDNN说话人嵌入,并仅训练LoRA适配器,使LLM具备自动说话人验证(ASV)能力。在TinyLLaMA-1.1B上,得到的ECAPA-LLM在VoxCeleb1-E上实现了1.03%的EER,接近专用说话人验证系统,同时保留了自然语言接口。

英文摘要

Speech-aware large language models (LLMs) can accept speech inputs, yet their training objectives largely emphasize linguistic content or specific fields such as emotions or the speaker's gender, leaving it unclear whether they encode speaker identity. First, we propose a model-agnostic scoring protocol that produces continuous verification scores for both API-only and open-weight models, using confidence scores or log-likelihood ratios from the Yes/No token probabilities. Using this protocol, we benchmark recent speech-aware LLMs and observe weak speaker discrimination (EERs above 20% on VoxCeleb1). Second, we introduce a lightweight augmentation that equips an LLM with ASV capability by injecting frozen ECAPA-TDNN speaker embeddings through a learned projection and training only LoRA adapters. On TinyLLaMA-1.1B, the resulting ECAPA-LLM achieves 1.03% EER on VoxCeleb1-E, approaching a dedicated speaker verification system while preserving a natural-language interface.

2606.05739 2026-06-18 cs.SD eess.AS 版本更新

Do speech foundation models perceive speaker similarity as humans do?

语音基础模型是否像人类一样感知说话人相似性?

Minoru Kishi, Hayato Yagi, Shinnosuke Takamichi, Yuki Saito

发表机构 * Keio University, Japan(庆应大学,日本) The University of Tokyo, Japan(东京大学,日本)

AI总结 本研究通过比较40多个语音基础模型的说话人嵌入与人类主观相似性评分,探究模型距离是否与人类感知一致,并识别影响模型与人类感知一致性的关键配置因素。

Comments Accepted by INTERSPEECH 2026. Camera-ready version

详情
AI中文摘要

本研究对语音基础模型的说话人嵌入与人类对说话人相似性的主观感知进行了比较分析。人类听众能够在一个连续尺度上判断说话人的相似性,辨别两个声音的相似程度。相比之下,语音基础模型将说话人特征嵌入到数值表示中。然而,一个问题仍然存在:这些模型中说话人嵌入之间的数值距离是否真正与人类感知的相似性一致?为了解决这个问题,我们使用超过40个模型进行了全面调查,将模型导出的距离与人类感知的相似性评分进行比较。此外,我们确定了模型配置中的哪些因素对产生反映人类感知的说话人嵌入贡献最大。我们的发现为开发更具感知基础的语音基础模型提供了见解。

英文摘要

This study presents a comparative analysis between the speaker embeddings of speech foundation models and human subjective perception of speaker similarity. Human listeners have the ability to judge speaker similarity on a continuous scale discerning how similar two voices are. In contrast, speech foundation models embed speaker characteristics into numerical representation. However, a question remains: does the numerical distance between speaker embeddings in these models truly align with the similarity perceived by humans? To address this, we conduct a comprehensive investigation using more than 40 models to compare model-derived distances with human-perceived similarity scores. Furthermore, we identify which factors in model configuration contribute most to a speaker embedding that mirrors human perception. Our findings provide insights for the development of more perceptually grounded speech foundation models.

3. 音乐信息检索与音乐生成 1 篇

2606.15088 2026-06-18 cs.SD cs.CL eess.AS 版本更新

When the Same Musical Knowledge Forgets Differently: A Clean Probe of Pathway-Dependent Forgetting

当相同的音乐知识以不同方式遗忘:路径依赖遗忘的干净探测

Yu Liu, Zhiwei Yang, Wenxiao Zhang, Cong Cao, Fangfang Yuan, Kun Peng, Haimei Qin, Lei Jiang, Jin B. Hong, Hao Peng, Yanbing Liu

发表机构 * Institute of Information Engineering, CAS(中国科学院信息工程研究所) School of Cyber Security, UCAS(中国科学院大学网络空间安全学院) The University of Western Australia(西澳大利亚大学) Beihang University(北京航空航天大学)

AI总结 提出配对路径控制协议(PPCP),发现多模态模型中通过文本路径获取的知识比音频路径更易遗忘,且该效应不受架构深度影响,主要源于输入表示差异。

详情
AI中文摘要

一个模型可以通过听音频或阅读文本描述来学习钢琴曲《致爱丽丝》是平静而沉思的,但当这些知识后来面临遗忘风险时,获取路径是否重要?多模态模型中的遗忘研究衡量了在适应过程中丢失了哪些知识,但尚未探究获取路径是否影响知识被遗忘的难易程度。我们将这个未经检验的前提称为路径不变假设。音乐理解提供了一个干净的测试,因为一段音乐剪辑和一段规范的文本描述可以对齐到相同的感知内容,使得相同的知识单元可以通过听或读进入模型,而目标保持不变。在多个架构不同的音频-语言模型中,我们观察到一致的不对称性:在相同的适应压力下,文本路径知识比匹配的音频路径知识更容易被遗忘。为了将这种效应归因于路径而非混淆因素,我们引入了配对路径控制协议(PPCP),这是一个三阶段设计,建立匹配的路径基线,在相同的知识池上以对称监督激活两条路径,并对两条路径施加相同的遗忘压力。这种差距在模型间和增益控制分析中稳定存在,当矛盾覆盖被替换为正确标签的跨域学习时仍然存在,在单模态压力下仍然存在,并且不会被轻量级重放消除。两个独立的路径深度控制证实,该效应不能由架构深度解释,表明输入表示是主导因素。在PPCP下,我们的结果表明遗忘高度依赖于路径,将获取路径确立为遗忘研究和多模态系统设计的一个新的分析维度。

英文摘要

A model can learn that the piano piece Für Elise is calm and reflective by listening to the audio or by reading a text description, but does it matter which route that knowledge took when it is later at risk of being forgotten? Forgetting research in multimodal models measures what knowledge is lost under adaptation, yet has not asked whether acquisition route affects how easily that knowledge is forgotten. We call this untested premise the Pathway-Invariant Assumption. Music understanding enables a clean test because a music clip and a canonical text description can be aligned to the same perceptual content, allowing the same knowledge unit to enter a model through listening or reading while the target remains fixed. Across multiple architecturally distinct audio-language models, we observe a consistent asymmetry: text-pathway knowledge is forgotten more than matched audio-pathway knowledge under identical adaptation pressure. To attribute this effect to route rather than confounds, we introduce the Paired Pathway Controlled Protocol (PPCP), a three-phase design that establishes matched pathway baselines, activates both pathways under symmetric supervision on the same knowledge pool, and applies identical forgetting pressure to both pathways. The gap is stable across models and gain-controlled analyses, persists when contradictory overwrite is replaced by correct-label cross-domain learning, remains under single-modality pressure, and is not removed by lightweight replay. Two independent routing-depth controls confirm that the effect is not explained by architectural depth, pointing to input representation as the dominant factor. Under PPCP, our results demonstrate that forgetting is highly route-dependent, establishing acquisition route as a new analytical dimension for forgetting research and multimodal system design.

4. 语音翻译与语音语言模型 1 篇

2508.07375 2026-06-18 cs.CL cs.SD eess.AS 版本更新

TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving

TurnGuide: 通过动态轮次级文本-语音交错增强有意义的全双工口语交互

Wenqian Cui, Lei Zhu, Xiaohui Li, Zhihan Guo, Haoli Bai, Lu Hou, Irwin King

发表机构 * The Chinese University of Hong Kong(香港中文大学) Huawei Technologies(华为技术)

AI总结 提出TurnGuide方法,通过动态分割助手语音为对话轮次并交错生成轮次级文本和语音,解决全双工语音语言模型在连续双通道音频中集成离散文本令牌导致的时间对齐问题,显著提升语义连贯性和轮次交互性能。

Comments Interspeech 2026 Long Paper Track

详情
AI中文摘要

全双工语音语言模型(FD-SLMs)是专门的基础模型,旨在通过建模复杂的对话轮次(如打断、反馈和重叠语音)来实现自然的实时口语交互。端到端(e2e)FD-SLMs利用真实世界的双通道对话数据捕捉细微的双说话者对话模式以实现类人交互,但由于语音序列过长和高质量口语对话数据有限,其对话能力往往比纯文本对话有所下降。尽管交错文本-语音生成可以缓解这种退化,但将离散文本令牌集成到连续双通道音频流中可能会破坏流畅交互所需的时间对齐。为了解决这个问题,我们提出了TurnGuide,一种用于e2e FD-SLMs的新型文本-语音交错生成方法,该方法动态地将助手语音分割成对话轮次,并交错生成轮次级文本和语音。这种方法使FD-SLMs能够整合LLMs的语义智能,同时不损害自然的声学流畅性。大量实验表明,TurnGuide不仅显著提升了e2e FD-SLMs生成语义有意义且连贯语音的能力,而且在各种轮次事件上达到了最先进的性能。演示请访问此https URL。代码请访问此https URL。

英文摘要

Full-Duplex Speech Language Models (FD-SLMs) are specialized foundation models designed to enable natural, real-time spoken interactions by modeling complex conversational turn-taking such as interruptions, backchannels, and overlapping speech. End-to-end (e2e) FD-SLMs leverage real-world double-channel conversational data to capture nuanced two-speaker dialogue patterns for human-like interactions, but their conversational abilities often degrade compared to pure-text conversation due to prolonged speech sequences and limited high-quality spoken dialogue data. Although interleaved text-speech generation could mitigate this degradation, integrating discrete text tokens into continuous double-channel audio streams could disrupt the precise time alignment required for fluid interaction. To address this, we propose TurnGuide, a novel text-speech interleaved generation approach for e2e FD-SLMs that dynamically segments assistant speech into dialogue turns and interleaves turn-level text and speech generation. This approach allows FD-SLMs to integrate the semantic intelligence of LLMs without compromising the natural acoustic flow. Extensive experiments show that TurnGuide not only significantly improves e2e FD-SLMs to produce semantically meaningful, coherent speech but also achieves state-of-the-art performance on various turn-taking events. Demos are available at https://dreamtheater123.github.io/TurnGuide-Demo/. Code is available at https://github.com/dreamtheater123/TurnGuide.

5. 多模态音频与视听学习 1 篇

2605.26672 2026-06-18 cs.MM cs.SD 版本更新

Can We Hear from Events? Generating Speech from Event Camera

我们能从事件中听到声音吗?从事件相机生成语音

Jingping Fang, Lin Chen, Chenyang Xu, Tong Zhao, Weidong Cai, Xiaoming Chen

发表机构 * Beijing Technology and Business University(北京技术与商业大学) Xidian University(西安电子科技大学) Tongji University(同济大学) University of Sydney(悉尼大学)

AI总结 提出EventSpeech框架,利用神经形态事件相机的高时间精度解决传统RGB语音生成中的时间粒度不匹配问题,实现情感丰富且抗运动模糊的语音生成。

详情
AI中文摘要

传统的基于RGB的语音生成面临时间粒度不匹配问题,因为固定的相机曝光时间不可避免地模糊了渲染情感语音所需的高频发音瞬态。为了打破这一限制,我们提出EventSpeech,这是一个新颖的文本条件框架,率先利用神经形态事件进行表达性语音生成,因为这些微秒级精确的事件自然与声学波形动态对齐。我们的架构集成了一个专用的事件编码器来建模稀疏的神经形态事件,以及一个多尺度音频编码器,其中包含分层小波上下文器(HWC)。双向对齐机制无缝地将语言内容和视觉动态与密集的声学特征同步。此外,我们构建了EVT-SPK作为第一个基准,包括大规模合成数据和来自专用神经形态硬件的真实世界记录。大量评估表明,EventSpeech通过保留细粒度情感和抵抗运动模糊,显著优于当前基线,为多模态语音生成建立了新范式。代码和演示可在https://xrfang-0102.github.io/EventSpeechWeb/获取。

英文摘要

Traditional RGB-based speech generation faces Temporal Granularity Mismatch since fixed camera exposure times inevitably blur the high-frequency articulatory transients essential for rendering emotional speech. To break this ceiling, we propose EventSpeech as a novel text-conditioned framework pioneering the use of neuromorphic events for expressive speech generation, since these microsecond-precise events naturally align with acoustic waveform dynamics. Our architecture integrates a dedicated Event Encoder to model sparse neuromorphic events alongside a multi-scale Audio Encoder featuring a Hierarchical Wavelet Contextualizer (HWC). A bidirectional alignment mechanism seamlessly synchronizes linguistic content and visual dynamics with dense acoustic features. Furthermore, we construct EVT-SPK as the first benchmark comprising large-scale synthetic data and real-world recordings from specialized neuromorphic hardware. Extensive evaluations demonstrate that EventSpeech significantly outperforms current baselines by preserving fine-grained emotions and resisting motion blur to establish a new paradigm for multimodal speech generation. Code and demo are available at https://xrfang-0102.github.io/EventSpeechWeb/.

6. 低资源、多语言与方言语音 1 篇

2603.06310 2026-06-18 eess.AS cs.CL cs.SD 版本更新

Continual Adaptation for Pacific Indigenous Speech Recognition

太平洋土著语音识别的持续适应

Yang Xiao, Aso Mahmudi, Nick Thieberger, Eliathamby Ambikairajah, Eun-Jung Holden, Ting Dang

发表机构 * The University of Melbourne(墨尔本大学) UNSW Sydney(新南威尔士大学悉尼分校)

AI总结 针对太平洋土著语言数据稀缺和灾难性遗忘问题,研究语音基础模型的适应策略,发现LoRA在顺序学习中会灾难性遗忘,需定制鲁棒适应方法。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

语音基础模型在处理资源匮乏的太平洋土著语言时面临严重的数据稀缺问题。此外,完全微调存在灾难性遗忘的风险。为弥补这一空白,我们提出了一项实证研究,将模型适应到真实的太平洋数据集。我们研究了数据量、适应策略和表征漂移对多种太平洋语言语音基础模型的影响。此外,我们分析了一个用于顺序语言习得的持续学习框架。跨三种不同的太平洋土著语言的实证结果表明,适应这些语言距离较远的语言会引发严重的内部表征漂移。因此,这些模型面临严格的可塑性与稳定性困境。虽然LoRA初始适应良好,但在顺序学习过程中会出现灾难性遗忘。最终,本研究强调了为代表性不足的语言定制鲁棒适应策略的迫切需求。

英文摘要

Speech foundation models struggle with low-resource Pacific Indigenous languages because of severe data scarcity. Furthermore, full fine-tuning risks catastrophic forgetting. To address this gap, we present an empirical study adapting models to real-world Pacific datasets. We investigate the impact of data volume, adaptation strategies, and representational drift on speech foundation models for various Pacific languages. Additionally, we analyze a continual learning framework for sequential language acquisition. Empirical results across three distinct Pacific Indigenous languages demonstrate that adapting to these linguistically distant languages induces severe internal representational drift. Consequently, these models face a strict plasticity and stability dilemma. While LoRA adapts well initially, it suffers from catastrophic forgetting during sequential learning. Ultimately, this study highlights the urgent need for robust adaptation strategies tailored to underrepresented languages.

7. 数据集、基准与评测 1 篇

2603.05128 2026-06-18 eess.AS cs.SD 版本更新

PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio

PolyBench:多声部音频中组合推理的基准测试

Yuanjian Chen, Yang Xiao, Han Yin, Xubo Liu, Jinjie Huang, Ting Dang

发表机构 * Harbin University of Science and Technology(哈尔滨理工大学) The University of Melbourne(墨尔本大学) KAIST(韩国成均馆大学) University of Surrey(萨里大学)

AI总结 针对多声部音频中组合推理评估缺失的问题,提出PolyBench基准,包含计数、分类、检测、并发和时长估计五个子集,评估发现现有大音频语言模型在多声部场景下性能持续下降。

Comments Accepted by INTERSPEECH 2026

详情
AI中文摘要

大型音频语言模型(LALMs)在音频推理方面能力日益增强,然而现有基准对多声部音频(多个声音事件同时发生并产生组合结构)中的推理覆盖有限。为弥补这一空白,我们引入了PolyBench,这是一个旨在评估多声部音频中组合推理的基准,包含五个评估子集,涵盖计数、分类、检测、并发和时长估计,所有这些都需要对多个并发事件及其关系进行推理。我们对最先进的LALMs的评估揭示了在多声部设置中性能持续下降,表明当前LALMs存在根本性瓶颈。

英文摘要

Large Audio Language Models (LALMs) are increasingly capable of reasoning over audio, yet existing benchmarks offer limited coverage of reasoning in polyphonic audio, where multiple sound events co-occur and induce compositional structure. To address this gap, we introduce PolyBench, a benchmark designed to evaluate compositional reasoning in polyphonic audio, comprising five evaluation subsets that cover counting, classification, detection, concurrency, and duration estimation, all of which require reasoning over multiple concurrent events and their relations. Our evaluation of state-of-the-art LALMs reveals consistent performance degradation in polyphonic settings, indicating a fundamental bottleneck in current LALMs.

8. 安全、隐私与深度伪造音频 2 篇

2603.04865 2026-06-18 cs.SD 版本更新

The First Environmental Sound Deepfake Detection Challenge: Benchmarking Robustness, Evaluation, and Insights

环境声音深度伪造检测挑战赛:鲁棒性、评估与洞察的基准测试

Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Ting Dang

发表机构 * School of Electrical Engineering, KAIST, Daejeon, Republic of Korea(韩国成均馆大学电气工程学院) University of Melbourne, Australia(墨尔本大学) Fortemedia Singapore, Singapore(新加坡Fortemedia公司) Xi’an University of Posts & Telecommunications, Xi’an, China(西安邮电大学) Xi'an Lianfeng Acoustic Technologies Co., Ltd., China(西安联丰声学技术有限公司)

AI总结 本文介绍了环境声音深度伪造检测挑战赛,探讨了鲁棒性评估、系统架构及未来研究方向,提出了环境声音深度伪造检测的关键挑战与机遇。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

近年来,音频生成技术的进步使得创建高度逼真的环境声音景观变得更加容易,这可能被滥用于制造欺骗性内容,如假警报、枪声和人群声音,从而引发公众安全和信任的担忧。尽管语音和歌唱声的深度伪造检测已被广泛研究,但环境声音深度伪造检测(ESDD)仍处于探索阶段。为了推动ESDD的发展,首次ESDD挑战赛被启动,吸引了97支注册团队,收到了1748份有效提交。本文提出了该任务的定义、数据集构建、评估协议、基线系统以及挑战赛结果中的关键见解。此外,我们分析了高性能系统中常见的架构选择和训练策略。最后,我们讨论了ESDD的潜在未来研究方向,概述了关键机会和开放问题,以指导该领域后续研究。

英文摘要

Recent progress in audio generation has made it increasingly easy to create highly realistic environmental soundscapes, which can be misused to produce deceptive content, such as fake alarms, gunshots, and crowd sounds, raising concerns for public safety and trust. While deepfake detection for speech and singing voice has been extensively studied, environmental sound deepfake detection (ESDD) remains underexplored. To advance ESDD, the first edition of the ESDD challenge was launched, attracting 97 registered teams and receiving 1,748 valid submissions. This paper presents the task formulation, dataset construction, evaluation protocols, baseline systems, and key insights from the challenge results. Furthermore, we analyze common architectural choices and training strategies among top-performing systems. Finally, we discuss potential future research directions for ESDD, outlining key opportunities and open problems to guide subsequent studies in this field.

2602.04796 2026-06-18 eess.AS cs.SD 版本更新

LALM-as-a-Judge: Benchmarking Large Audio-Language Models for Safety Evaluation in Multi-Turn Spoken Dialogues

LALM-as-a-Judge:用于多轮口语对话安全评估的大型音频语言模型基准测试

Amir Ivry, Shinji Watanabe

发表机构 * Computer Engineering, Technion--Israel Institute of Technology, Haifa, Israel(技术学院电子工程系,技术离子技术研究所,以色列海法) Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA(语言技术研究所,卡内基梅隆大学,美国匹兹堡)

AI总结 针对口语对话中社会不安全内容评估仍以文本为中心、忽略韵律和转录失败的问题,提出包含24000个多轮口语对话的开放基准,评估6种大型音频语言模型在文本、音频和多模态设置下的敏感性、严重性顺序特异性和轮次位置偏差,发现音频提供非词汇证据,多模态增益非普遍且存在多种模式。

Comments Accepted to ICML 2026

详情
AI中文摘要

对口语对话中社会不安全内容的评估仍然以文本为中心,忽略了韵律和转录失败。我们提出了LALM-as-a-Judge,其中包括一个包含24000个多轮口语对话的开放基准,每个对话包含一个局部不安全轮次,这些对话基于8个社会不安全类别和5个严重级别生成。我们评估了6种大型音频语言模型(LALMs)作为评判者,包括开源和闭源模型,在纯文本、纯音频和多模态设置下,针对对话中社会有害内容的敏感性、严重性顺序特异性和轮次位置偏差。结果表明,音频提供了超越转录语义的非词汇证据,并且多模态增益并非普遍存在,而是可以表现为文本锚定、平衡、保守和干扰,我们将这些归因于音频路径瓶颈和融合限制。我们将该基准定位为诊断工具,并为模型、模态和提示选择提供实践者指导。

英文摘要

Evaluation of socially unsafe content in spoken dialogues remains text-centric, missing prosody and transcription failures. We present LALM-as-a-Judge, which includes an open benchmark of 24,000 multi-turn spoken dialogues with one localized unsafe turn, generated out of 8 socially unsafe categories and 5 severity levels. We evaluate 6 large audio-language models (LALMs) as judges, open and closed-source, in text-only, audio-only, and multimodal setups by their sensitivity, severity-order specificity, and turn-position bias for socially harmful content in the dialogue. Results show that audio contributes non-lexical evidence beyond transcript semantics and that multimodal gains are not universal but can be text-anchored, balanced, conservative, and interfering, which we link to the audio pathway bottlenecks and fusion limits. We position the benchmark as diagnostic and derive practitioner guidance for model, modality, and prompts choices.

9. 其他/综合语音音频 3 篇

2604.18109 2026-06-18 cs.CL cs.SD 版本更新

FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings

FLiP:理解和解释多模态多语句子嵌入

Santosh Kesiraju, Bolaji Yusuf, Šimon Sedláček, Oldřich Plchot, Petr Schwarz

发表机构 * Brno University of Technology(布拉格技术大学)

AI总结 提出因子化线性投影(FLiP)模型,从多语言、多模态句子嵌入中恢复词汇内容,揭示编码器的模态和语言偏差。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

本文提出了因子化线性投影(FLiP)模型,用于理解预训练句子嵌入空间。我们训练FLiP模型从多语言(LaBSE)、多模态(SONAR)和基于API(Gemini)的句子嵌入空间中恢复多种高资源和中等资源语言的词汇内容。我们表明,FLiP可以从嵌入中召回超过75%的词汇内容,显著优于现有的非因子化基线。使用此作为诊断工具,我们揭示了所选句子编码器的模态和语言偏差,并为从业者提供了关于编码器的内在见解,而无需依赖传统的下游评估任务。我们的实现已公开,链接见此:https://this URL。

英文摘要

This paper presents factorized linear projection (FLiP) models for understanding pretrained sentence embedding spaces. We train FLiP models to recover the lexical content from multilingual (LaBSE), multimodal (SONAR) and API-based (Gemini) sentence embedding spaces in several high- and mid-resource languages. We show that FLiP can recall more than 75% of lexical content from the embeddings, significantly outperforming existing non-factorized baselines. Using this as a diagnostic tool, we uncover the modality and language biases across the selected sentence encoders and provide practitioners with intrinsic insights about the encoders without relying on conventional downstream evaluation tasks. Our implementation is public https://github.com/BUTSpeechFIT/FLiP.

2406.15537 2026-06-18 q-bio.NC cs.AI cs.SD eess.AS 版本更新

R&B -- Rhythm and Brain: Cross-subject Decoding of Music from Human Brain Activity

Matteo Ferrante, Matteo Ciferri, Nicola Toschi

发表机构 * Department of Biomedicine and Prevention University of Rome Tor Vergata(生物医学与预防系罗马大学托尔维加塔分校) A.A. Martinos Center for Biomedical Imaging Harvard Medical School/MGH, Boston (US)(A.A. Martinos生物医学成像中心哈佛医学院/马萨诸塞总医院,波士顿(美国))

Comments The first two authors contributed equally to this work

详情
Journal ref
Neural Networks, 203, 109195 (2026)
英文摘要

Music is a universal phenomenon that profoundly influences human experiences across cultures. This study investigates whether music can be decoded from human brain activity measured with functional MRI (fMRI) during its perception. Leveraging recent advancements in extensive datasets and pre-trained computational models, we construct mappings between neural data and latent representations of musical stimuli. Our approach integrates functional and anatomical alignment techniques to facilitate cross-subject decoding, addressing the challenges posed by the low temporal resolution and signal-to-noise ratio (SNR) in fMRI data. Starting from the GTZan fMRI dataset, where five participants listened to 540 musical stimuli from 10 different genres while their brain activity was recorded, we used the CLAP (Contrastive Language-Audio Pretraining) model to extract latent representations of the musical stimuli and developed voxel-wise encoding models to identify brain regions responsive to these stimuli. By applying a threshold to the association between predicted and actual brain activity, we identified specific regions of interest (ROIs) which can be interpreted as key players in music processing. Our decoding pipeline, primarily retrieval-based, employs a linear map to project brain activity to the corresponding CLAP features. This enables us to predict and retrieve the musical stimuli most similar to those that originated the fMRI data. Our results demonstrate state-of-the-art identification accuracy, with our methods significantly outperforming existing approaches. Our findings suggest that neural-based music retrieval systems could enable personalized recommendations and therapeutic applications. Future work could use higher temporal resolution neuroimaging and generative models to improve decoding accuracy and explore the neural underpinnings of music perception and emotion.

2206.05018 2026-06-18 cs.SD cs.CL eess.AS 版本更新

Going Beyond the Cookie Theft Picture Test: Detecting Cognitive Impairments using Acoustic Features

Franziska Braun, Andreas Erzigkeit, Hartmut Lehfeld, Thomas Hillemacher, Korbinian Riedhammer, Sebastian P. Bayerl

Comments Accepted at the 25th International Conference on Text, Speech and Dialogue (TSD 2022)

详情
Journal ref
Proceedings of the 25th International Conference on Text, Speech, and Dialogue (TSD 2022)
英文摘要

Standardized tests play a crucial role in the detection of cognitive impairment. Previous work demonstrated that automatic detection of cognitive impairment is possible using audio data from a standardized picture description task. The presented study goes beyond that, evaluating our methods on data taken from two standardized neuropsychological tests, namely the German SKT and a German version of the CERAD-NB, and a semi-structured clinical interview between a patient and a psychologist. For the tests, we focus on speech recordings of three sub-tests: reading numbers (SKT 3), interference (SKT 7), and verbal fluency (CERAD-NB 1). We show that acoustic features from standardized tests can be used to reliably discriminate cognitively impaired individuals from non-impaired ones. Furthermore, we provide evidence that even features extracted from random speech samples of the interview can be a discriminator of cognitive impairment. In our baseline experiments, we use OpenSMILE features and Support Vector Machine classifiers. In an improved setup, we show that using wav2vec 2.0 features instead, we can achieve an accuracy of up to 85%.