arXivDaily arXiv每日学术速递 周一至周五更新
重置
eess.AS音频语音24
2606.12328 2026-06-11 eess.AS 新提交

HALO: Half-Frame-Rate Adaptive Learnable Operator for Lightweight STFT-Based Speech Enhancement

HALO:半帧率自适应可学习算子用于轻量级基于STFT的语音增强

Jiadong Zhao, Dahan Wang, Yu Sun, Leyan Yang, Xiaobin Rong, Shiruo Sun, Yuxiang Hu, Jing Lu

AI总结 提出HALO模块,通过半帧率处理减少STFT重叠帧冗余,降低轻量模型计算成本,在DNS3数据集上验证了有效性。

详情
Comments
Accepted by Interspeech 2026
AI中文摘要

基于STFT的语音增强通常采用重叠分析帧。虽然重叠对于稳定的STFT处理至关重要,但它使相邻帧高度相关,导致轻量模型中的冗余计算。我们提出了半帧率自适应可学习算子(HALO),这是一个因果插件模块,在不改变STFT过程的情况下将内部帧率减半。HALO广泛适用于许多轻量模型,在骨干网络之前应用自适应速率降低,之后进行恢复,在原始STFT网格上重建全速率频谱。降低和恢复均通过轻量动态卷积实现。通过将处理帧率减半,HALO在不增加算法延迟的情况下降低了骨干网络的计算成本,为通道扩展释放了预算。在DNS3数据集上的实验表明,在匹配复杂度下,各种轻量模型均获得一致提升,证明了减少重叠引起的冗余的有效性。

英文摘要

STFT-based speech enhancement typically adopts overlapping analysis frames. While overlap is essential for stable STFT processing, it makes adjacent frames highly correlated, causing redundant computation in lightweight models. We propose Half-frame-rate Adaptive Learnable Operator (HALO), a causal plug-in module that halves the internal frame rate without altering the STFT procedure. Broadly applicable to many lightweight models, HALO applies adaptive rate reduction before the backbone and restoration afterward, reconstructing the full-rate spectrum on the original STFT grid. Both reduction and restoration are implemented with lightweight dynamic convolutions. By halving the processed frame rate, HALO reduces backbone compute cost with no added algorithmic latency, freeing budget for channel widening. Experiments on the DNS3 dataset show consistent gains across diverse lightweight models under matched complexity, demonstrating the effectiveness of reducing overlap-induced redundancy.

2606.12199 2026-06-11 eess.AS cs.CL cs.SD 新提交

Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation

哪种语音表示更匹配文本原生推理?帧率和表示对语音-文本对齐的研究

Zhen Ye, Xu Tan, Yiming Li, Guangyan Zhang, Chimin Chan, Haohe Liu, Zhengxi Liu, Hongzhan Lin, Zheqi Dai, Xinshen Zhang, Peiwen Sun, Qiuqiang Kong, Wei Xue

AI总结 研究语音与文本模态差异中的时间粒度不匹配问题,提出因子化FSQ和轻量非自回归音频LM头以降低帧率,发现4.17Hz帧率结合中间层表示对齐在语音问答中表现最佳。

详情
Comments
Accepted by Interspeech 2026 long paper
AI中文摘要

口语对话模型通常以文本LLM骨干网络为基础,但在以语音而非文本为条件时,推理能力往往会下降。我们将这种模态差异部分归因于时间粒度不匹配:在语义匹配的情况下,语音标记在时间上是冗余的,且远长于文本,这稀释了每个标记的语义密度,削弱了文本原生的推理动态。我们将语音标记设计视为一个表示选择问题,并在固定信息速率下,在冻结的LLM骨干网络中扫描帧率。为了实现低帧率,我们引入了因子化FSQ和一个轻量级的非自回归音频LM头,在不牺牲高效预测的情况下将容量扩展到近300比特/帧。在消除瓶颈后,我们扫描帧率(50→2.08 Hz)和对齐深度,并观察到在4.17 Hz帧率下,结合中间层表示对齐,语音问答存在一致的最佳区域。

英文摘要

Spoken dialogue models typically start from text LLM backbones, yet reasoning often degrades when conditioning on speech instead of text. We attribute part of this modality gap to a temporal-granularity mismatch: speech tokens are temporally redundant and far longer than text under matched semantics, diluting per-token semantic density and weakening text-native reasoning dynamics. We study speech token design as a representation selection problem and sweep frame rates under a frozen LLM backbone with a fixed information rate. To make low frame rates feasible, we introduce factorized FSQ and a lightweight non-autoregressive audio LM head, scaling capacity to nearly 300\,bits/frame without sacrificing efficient prediction. With the bottleneck removed, we sweep frame rates (50$\rightarrow$2.08\,Hz) and alignment depth, and observe a consistent best regime for speech QA at 4.17\,Hz with intermediate-layer representation alignment.

2606.11836 2026-06-11 cs.SD cs.AI eess.AS 新提交

Towards Data-free and Training-free Compression for Speech Foundation Models Using Parameter Clustering

面向语音基础模型的无数据无训练压缩:基于参数聚类的方法

Haoning Xu, Zhaoqing Li, Huimeng Wang, Youjun Chen, Chengxi Deng, Mengzhe Geng, Xunying Liu

发表机构 * The Chinese University of Hong Kong(香港中文大学) National Research Council Canada(加拿大国家研究委员会)

AI总结 提出一种基于k-means通道聚类的无数据无训练压缩方法,通过层间不同参数簇数实现细粒度混合稀疏剪枝,在HuBERT-large和Whisper-large-v3上显著降低WER。

详情
Comments
Accepted by Interspeech 2026
AI中文摘要

本文提出了一种新颖的无数据无训练压缩方法,用于语音基础模型,该方法通过k-means进行通道级聚类。还探索了更细粒度的混合稀疏剪枝,通过层间不同数量的参数簇实现。在LibriSpeech数据集上进行的实验表明,当对HuBERT-large进行50%的剪枝稀疏度操作时,在微调前,测试干净和测试其他子集上,相对于基于幅度的剪枝,获得了27.73%/18.61%绝对(34.37%/21.91%相对)的一致WER降低;在仅3个epoch的微调后,获得了0.19%/0.79%绝对(3.36%/4.62%相对)的降低。在Whisper-large-v3上,在10%稀疏度下,相对于基于幅度的剪枝,观察到2.86%/5.02%绝对(59.21%/55.29%相对)的类似WER降低,所有这些相对于未压缩基线均没有显著的WER增加。

英文摘要

This paper presents a novel data-free and training-free compression approach for speech foundation models using channelwise clustering via k-means. More fine-grained, mixed sparsity pruning by layer-level varying number of parameter clusters is also explored. Experiments conducted on the LibriSpeech dataset suggest that when operating with pruning sparsity of 50% on HuBERT-large, consistent WER reductions of 27.73%/18.61% absolute (34.37%/21.91% relative) over the magnitude-based pruning were obtained on the test-clean and test-other subsets before fine-tuning and 0.19%/0.79% absolute (3.36%/4.62% relative) after fine-tuning with only 3 epochs. Similar WER reductions of 2.86%/5.02% absolute (59.21%/55.29% relative) were observed against magnitudebased pruning on Whisper-large-v3 at 10% sparsity, all with no significant WER increase relative to the uncompressed baseline.

2606.11795 2026-06-11 eess.AS cs.SD 新提交

Tight Boundary Prediction in Speaker Diarization Using Causal-Anticausal Consistency

说话人日志中的紧边界预测:基于因果-反因果一致性

Shota Horiguchi, Marc Delcroix, Naohiro Tawara, Takanori Ashihara, Atsushi Ando

AI总结 针对松标注训练导致预测边界松散的问题,提出利用因果与反因果模型生成紧伪标签,并通过协同训练迭代优化,恢复约70%的紧标签训练效果并提升下游性能。

详情
Comments
Accepted to Interspeech 2026 (Long Paper Track)
AI中文摘要

多说话人对话自动语音识别数据常用于训练说话人日志模型。由于此类数据优先考虑语义连续性,语音段中包含停顿和边界余量,导致标注松散。在此类数据上训练的模型倾向于内化产生这种松散性的机制,尽管紧语音区间有时更适用于下游应用。本文解决了利用松散标签使模型产生紧预测的新任务。我们的方法使用因果和反因果模型生成更紧的伪标签,这些模型本质上无法学习松散行为。我们进一步提出了一种协同训练方案,迭代地收紧标签并更新两个模型以进行更渐进式的优化。实验结果表明,所提方法恢复了理想紧标签训练所实现的约70%的收紧效果,并提升了下游性能。

英文摘要

Multi-talker conversational automatic speech recognition data are often used to train speaker diarization models. Because such data prioritize semantic continuity, pauses and boundary margins are included within speech segments, resulting in loose annotations. Models trained on such data tend to internalize mechanisms that reproduce this looseness, although tight speech intervals are sometimes preferable for downstream applications. In this paper, we address the novel task of enabling models to produce tight predictions using loose labels. Our method generates tighter pseudo labels using causal and anticausal models, which are inherently incapable of learning loosening behavior. We further propose a co-training scheme that iteratively tightens labels and updates both models for more progressive refinement. Experimental results show that the proposed method recovers about 70 % of the tightening effect achieved by ideal tight-label training and improves downstream performance.

2606.11766 2026-06-11 eess.AS cs.AI cs.CL cs.SD 新提交

Fast Speech Foundation Model Distillation Using Interleaved Stacking

快速语音基础模型蒸馏使用交错堆叠

Eungbeom Kim, Kyogu Lee

AI总结 提出交错堆叠方法加速语音基础模型蒸馏训练,通过保持层位置一致性解决性能下降问题,在SUPERB上验证有效性。

详情
Comments
Accepted by Interspeech 2026
AI中文摘要

将大型语音基础模型(SFM)蒸馏为高效的学生模型已成功应用于低资源环境。尽管蒸馏减少了推理延迟,但它需要额外的学生模型训练。然而,SFM蒸馏的训练效率仍未得到充分探索。在这项工作中,我们探索了SFM蒸馏的训练加速以加快模型部署。我们研究了堆叠的潜力,其中模型深度通过训练逐步增加,直到达到目标模型深度。虽然现有的堆叠方法提高了训练速度,但它们遭受性能下降。为了解决这一限制,我们提出了交错堆叠,一种新颖的堆叠方法,在整个堆叠过程中始终保持层位置。这一特性在SFM中尤为关键,因为每一层编码了不同的层特定知识。我们在SUPERB上验证了所提方法的有效性。

英文摘要

Distilling a large speech foundation model (SFM) into an efficient student model has been successfully applied to low-resource environments. Although distillation reduces inference latency, it requires an additional student model training. However, the training efficiency of SFM distillation remains underexplored. In this work, we explore training acceleration of SFM distillation to speed up model deployment. We examine the potential of stacking, in which the model depth is progressively increased through training until the target model depth is reached. While existing stacking methods improve training speed, they suffer from performance degradation. To handle this limitation, we propose interleaved stacking, a novel stacking method that consistently preserves layer position throughout the stacking process. This property is particularly critical in SFMs, in which each layer encodes distinct layer-specific knowledge. We validate the effectiveness of the proposed method on SUPERB.

2606.11631 2026-06-11 eess.AS cs.SD 新提交

Benchmarking Neural Speech Compression from a Rate-Distortion Perspective

从率失真角度基准测试神经语音压缩

Jun Xu, Zhengxue Cheng, Fengxi Zhang, Yuhan Liu, Li Song, Wenjun Zhang

AI总结 提出熵约束编解码器ECC,通过标量量化与学习熵模型结合,在低比特率下实现优于传统和神经编解码器的率失真性能。

详情
AI中文摘要

基于学习的语音压缩在低比特率性能上取得了有前景的成果,但许多神经语音编解码器仍使用预设速率的离散符号描述量化潜变量,或仅在符号生成后应用熵编码。这种设计将表示学习与概率建模解耦,限制了它们利用学习到的语音潜变量的非均匀使用和时间依赖性的能力。本文从率失真角度基准测试神经语音压缩,并进一步研究用于低比特率语音压缩的熵约束编码。我们首先制定了一个统一的基于学习的语音编码流程,并对最近的神经语音编解码器进行了基准测试风格的分析,表明显式概率建模在学习语音压缩中仍未得到充分探索。然后,我们提出了ECC,一种熵约束编解码器,它将标量量化与学习熵模型相结合。ECC集成了基于超先验的边信息、通道上下文建模、潜变量残差预测和轻量级时间建模,以在训练期间估计用于率估计的潜变量似然,并在推理期间进行算术编码。为了进一步提高低比特率效率,ECC引入了熵跳跃,它使用解码器可用的尺度估计省略高度可预测的残差符号,而无需传输额外的跳跃掩码。大量实验表明,ECC在低比特率下实现了优于传统和神经编解码器基线的率失真权衡,在两个广泛使用的测试集上,平均BD-rate在ViSQOL上降低39.9%,在PESQ上降低76.3%。消融和诊断研究进一步验证了熵建模的有效性。项目页面:此 https URL

英文摘要

Learning-based speech compression has achieved promising low-bitrate performance, but many neural speech codecs still describe quantized latents with preset-rate discrete symbols or apply entropy coding only after symbol generation. Such designs decouple representation learning from probability modeling, limiting their ability to exploit the non-uniform usage and temporal dependencies of learned speech latents. In this paper, we benchmark neural speech compression from a rate--distortion perspective and further investigate entropy-constrained coding for low-bitrate speech compression. We first formulate a unified learning-based speech coding pipeline and provide a benchmark-style analysis of recent neural speech codecs, showing that explicit probability modeling remains underexplored in learned speech compression. We then propose ECC, an Entropy-Constrained Codec that combines scalar quantization with a learned entropy model. ECC integrates hyperprior-based side information, channel-wise context modeling, latent residual prediction, and lightweight temporal modeling to estimate latent likelihoods for rate estimation during training and arithmetic coding during inference. To further improve low-bitrate efficiency, ECC introduces entropy skip, which omits highly predictable residual symbols using decoder-available scale estimates without transmitting additional skip masks. Extensive experiments show that ECC achieves a favorable low-bitrate rate--distortion trade-off over conventional and neural codec baselines, reducing BD-rate by 39.9% on ViSQOL and 76.3% on PESQ on average over two widely-used test sets. Ablation and diagnostic studies further validate the effectiveness of entropy modeling. Project Page: this https URL

2606.11581 2026-06-11 eess.AS cs.SD 新提交

Sensitivity Analysis of Generative Spatial Audio Metrics: A Study on Responsiveness, Smoothness, and Symmetry

生成式空间音频指标的敏感性分析:响应性、平滑性和对称性研究

Purnima Kamath, Adrian S. Roman, Koichi Saito, Yuki Mitsufuji, Juan P. Bello

AI总结 提出一个框架分析生成式空间音频指标对空间参数变化的敏感性,定义响应性、平滑性和对称性三个期望属性,评估标准指标后发现FAD和声学地图表现最佳。

详情
Comments
Accepted for publication at Interspeech 2026
AI中文摘要

由于对指标如何响应方位角和仰角等空间参数变化的理解有限,评估一阶环绕声(FOA)的生成式空间音频仍然具有挑战性。我们借鉴参数化声音合成中的敏感性分析原理,提出了一个沿连续空间轨迹分析指标敏感性的框架。通过使用复杂度递增的受控FOA场景,我们定义了指标行为的三个期望属性:响应性、平滑性和对称性。我们评估了标准基于分布和基于样本的指标,包括Fréchet音频距离(FAD)、强度向量和声学地图。我们的发现表明,使用定位特定嵌入和声学地图的FAD在不同条件下具有高响应性以及稳健的平滑性和对称性,而强度向量随着场景复杂度的增加而退化。这是研究生成式空间音频指标敏感性的第一步。

英文摘要

Evaluating generative spatial audio for First-Order Ambisonics (FOA) remains challenging due to a limited understanding of how metrics respond to changes in spatial parameters such as azimuth and elevation. We propose a framework to analyze metric sensitivity along continuous spatial trajectories, drawing on principles of sensitivity analysis in parametric sound synthesis. Using controlled FOA scenes with increasing scene complexity, we define three desiderata for metric behavior: Responsiveness, Smoothness, and Symmetry. We assess standard distribution-based and sample-based metrics, including Fréchet Audio Distance (FAD), intensity vectors, and acoustic maps. Our findings show that FAD using localization-specific embeddings and acoustic maps yield high Responsiveness and robust Smoothness and Symmetry across conditions, while intensity vectors degrade with increasing scene complexity. This is the first step towards investigating the sensitivity of metrics for generative spatial audio.

2606.11429 2026-06-11 eess.AS cs.CL cs.SD 新提交

Gumbel-BEARD: Automatic Layer Selection for Self-Supervised Adaptation of Whisper in Low-Resource Domains

Gumbel-BEARD:低资源领域Whisper自监督自适应的自动层选择

Zilai Wang, Natarajan Balaji Shankar, Mohan Shi, Kaiyuan Zhang, Abeer Alwan

AI总结 提出Gumbel-BEARD框架,通过可训练的Gumbel-Softmax选择器自动选择Whisper编码器层,结合BEST-RQ自监督目标实现低资源领域自适应,在儿童语音和方言数据集上取得最先进词错误率。

详情
Comments
Accepted by Interspeech 2026
AI中文摘要

语音基础模型在低资源领域常因领域不匹配和数据稀缺而表现不佳。我们提出Gumbel-BEARD,一种领域自适应框架,通过端到端可训练的硬Gumbel-Softmax选择器自动选择Whisper编码器层。它利用BEST-RQ目标实现自监督自适应,无需手动调整即可动态适应目标声学特征。在MyST儿童语音语料库上的实验证明了其效率和可扩展性:使用10小时标注数据进行微调,我们的方法匹配了在完整133小时标注集上训练的完全监督基线。我们在MyST上使用Whisper-medium建立了8.21%的新最先进词错误率(WER),在OGI自发言语数据集上使用Whisper-small达到11.06%。在CORAAL上的评估进一步证实了对成人方言领域偏移的鲁棒性,相对WER降低高达6%,突显了我们的方法对多样低资源条件的泛化能力。

英文摘要

Speech foundation models often struggle in low-resource domains due to domain mismatch and data scarcity. We propose Gumbel-BEARD, a domain adaptation framework that automates Whisper encoder layer selection via an end-to-end trainable hard Gumbel-Softmax selector. It enables self-supervised adaptation with a BEST-RQ objective that dynamically adapts to target acoustic characteristics without manual tuning. Experiments on the MyST child speech corpus demonstrate efficiency and scalability: with 10 h of labeled data for fine-tuning, our method matches a fully supervised baseline trained on the complete 133 h labeled set. We establish new state-of-the-art word error rates (WERs) of 8.21% using Whisper-medium on MyST and 11.06% using Whisper-small on the OGI Spontaneous dataset. Evaluation on CORAAL further confirms robustness to adult dialectal domain shifts, with up to 6% relative WER reduction, highlighting the generalizability of our approach to diverse low-resource conditions.

2606.11400 2026-06-11 cs.SD cs.AI eess.AS 新提交

Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models

引导听哪里:基于指令的激活操控重定向大型音频语言模型中的时间注意力

Tsung-En Lin, Hung-Yi Lee

AI总结 提出基于指令的向量操控方法,通过对比不同指令下的激活来重定向音频令牌的时间注意力,实现无需训练的声音事件定位,显著优于直接提示和随机基线。

详情
AI中文摘要

大型音频语言模型(LALMs)在音频理解方面表现出色,但很少揭示它们关注音频信号的哪个部分。我们引入了基于指令的向量操控,该方法通过对比不同指令提示下的激活来构建操控向量,同时保持音频不变。通过对LALM注意力的系统探测,我们发现——与标准提示或基于音频的操控不同——这种干预显著重新分配了分配给音频令牌的时间注意力,将其集中在声学相关的区域。然后我们展示了这种注意力转移在行为上是有意义的:在受控的三事件设置中,读取由操控引起的最大注意力变化的时间位置,可以恢复查询声音事件的位置,而无需任何训练,在Qwen2-Audio和Audio Flamingo 3上分别达到60.87%和68.72%与真实区间的重叠,远高于直接提示(31.84%,46.75%)和随机基线(27.74%)。我们的结果表征了LALMs中基于指令的操控的机制特性,并为这些模型编码的潜在时间结构提供了一种无需训练的探测方法。

英文摘要

Large Audio-Language Models (LALMs) excel at audio understanding but expose little about where in an audio signal they attend. We introduce instruction-based vector steering, which constructs a steering vector by contrasting activations from differently instructed prompts while keeping the audio fixed. Through a systematic probe of LALM attention, we find that - unlike standard prompting or audio-based steering - this intervention significantly redistributes the temporal attention allocated to audio tokens, concentrating it on acoustically relevant regions. We then show that this attention shift is behaviorally meaningful: in a controlled three-event setting, reading out the temporal position of maximal steering-induced attention change recovers the location of a queried sound event without any training, attaining 60.87% and 68.72% overlap with ground-truth intervals on Qwen2-Audio and Audio Flamingo 3, far above direct prompting (31.84%, 46.75%) and random baselines (27.74%). Our results characterize a mechanistic property of instruction-based steering in LALMs and provide a training-free probe for the latent temporal structure these models encode.

2606.11386 2026-06-11 cs.CL cs.AI eess.AS 新提交

Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

通过激活引导克服全双工口语语言模型中的状态惯性

Cheng-Kuang Chang, Kai-Wei Chang, Alexander H. Liu, James Glass

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)

AI总结 针对全双工口语模型在用户打断时响应延迟的问题,提出基于感知向量的激活引导方法,无需微调即可显著提升中断理解能力。

详情
AI中文摘要

全双工口语语言模型(FD-SLMs)通过允许模型同时听和说实现无缝语音交互,但其协调听与说的内部机制尚未充分探索。我们分析了FD-SLM隐藏表示中编码的预测行为,发现它们表现出特定流的预测模式:在听时,它们优先预测传入的用户流;而在说时,它们优先预测模型输出流。基于这一观察,我们表明FD-SLMs动态调节其内部预测焦点在两个状态之间:与模型输出生成一致的生成状态和与传入用户输入一致的感知状态。然而,这种调节可能滞后于对话上下文的突然变化。在用户打断期间,模型在过渡到感知状态之前短暂地偏向生成状态,导致其错过传入输入的开头。我们将这种延迟的内部过渡称为状态惯性。为了量化其下游影响,我们引入了零缓冲基准(ZBB),这是一个用于评估当用户语音突然开始时即时中断理解能力的诊断基准。我们使用响应正确性和初始词出现率(IWOR)来评估这一设置。最后,我们通过使用感知向量的激活引导来缓解状态惯性,这是一种无需训练且计算开销很小的干预措施。在多个最先进的FD-SLMs上,激活引导显著改善了中断处理;例如,在PersonaPlex上,它将正确性从28%提高到45%,将IWOR从40%提高到72%,而无需任何微调。

英文摘要

Full-duplex spoken language models (FD-SLMs) enable seamless speech interaction by allowing models to listen and speak simultaneously, yet the internal mechanism by which they coordinate listening and speaking remains underexplored. We analyze the predictive behavior encoded in FD-SLM hidden representations and find that they exhibit stream-specific predictive patterns: during listening, they preferentially predict the incoming user stream, whereas during speaking, they preferentially predict the model output stream. Building on this observation, we show that FD-SLMs dynamically modulate their internal predictive focus between two states: a generative state aligned with model output generation and a perceptive state aligned with incoming user input. However, this modulation can lag behind abrupt changes in conversational context. During user interruptions, the model remains transiently biased toward the generative state before transitioning into the perceptive state, causing it to miss the beginning of the incoming input. We term this delayed internal transition state inertia. To quantify its downstream impact, we introduce the Zero-Buffer Benchmark (ZBB), a diagnostic benchmark for evaluating immediate interruption comprehension when user speech begins abruptly. We evaluate this setting using response correctness and initial-word occurrence rate (IWOR). Finally, we mitigate state inertia through activation steering with a perception vector, a training-free intervention with little additional computational overhead. Across multiple state-of-the-art FD-SLMs, activation steering substantially improves interruption handling; for example, on PersonaPlex, it improves correctness from 28% to 45% and IWOR from 40% to 72% without any fine-tuning.

2606.11371 2026-06-11 cs.CL cs.AI eess.AS eess.SP 新提交

The Dynamics of Human and AI-Generated Language: How Semantics Fluctuates across Different Timescales

人类与AI生成语言的动态:语义如何在不同时间尺度上波动

Han-Jen Chang, Yasir Çatal, Angelika Wolman, Agustín Ibáñez, David Smith, I-Wen Su, Kai-Yuan Cheng, Georg Northoff

AI总结 提出语义时间尺度分析流程,通过自相关窗口度量(ACW-0)量化人类与AI生成语音中语义特异性与上下文相似性的时间组织,发现ACW-0长度与词汇通用性相关,且该关联在随机化后被削弱。

详情
Comments
45 pages, 4 figures, 4 tables. Accepted manuscript; published in Computer Speech & Language
AI中文摘要

口语,无论是人类还是大型语言模型(LLM)产生的,都会随时间展开,具有变化的语义内容。然而,我们仍然缺乏简单、可解释的时间序列特征来捕捉通用与特定内容如何随时间分布,并可用于比较人类和AI生成的语音。我们引入了一个语义时间尺度分析流程,将带有时间戳的词级转录转换为语义时间序列。对于每个口语叙述,我们计算(i)基于WordNet词深度的语义特异性,以及(ii)基于SBERT嵌入的上下文相似性,并使用自相关窗口度量(ACW-0及相关指标)量化其时间依赖性。然后,我们将原始语音与多种随机化对照进行比较,这些对照选择性地破坏词汇身份、时间顺序和词时长。在人类朗读的自传叙述、TTS朗读和LLM生成的文本(通过TTS渲染)中,我们发现语义时间序列中ACW-0较长的片段往往包含更多通用词汇,而ACW-0较短的片段则富含更具体的词汇。当词序和计时被随机化时,这些关联被强烈削弱或消除,表明基于ACW的度量捕捉了语义内容超越静态词汇分布的非平凡时间组织。我们的结果表明,基于ACW的语义时间尺度是分析和比较人类与AI生成语音时间结构的有用特征系列。

英文摘要

Spoken language, whether produced by humans or large language models (LLM), unfolds over time with varying semantic content. However, we still lack simple, interpretable time-series features that capture how generic versus specific content is distributed over time, and that can be used to compare human and AI-generated speech. We introduce a semantic-timescale analysis pipeline that turns word-level transcripts with timestamps into semantic time-series. For each spoken narrative, we compute (i) semantic specificity using WordNet-based word depth and (ii) contextual similarity using SBERT embeddings and quantify their temporal dependence using autocorrelation-window measures (ACW-0 and related metrics). We then compare original speech to multiple shuffled controls that selectively disrupt lexical identity, temporal order, and word duration. Across human-read autobiographical narratives, TTS readings, and LLM-generated texts rendered with TTS, we find that segments with longer ACW-0 in the semantic time-series tend to contain more generic vocabulary, whereas segments with shorter ACW-0 are enriched in more specific words. These associations are strongly attenuated or abolished when word order and timing are randomized, indicating that ACW-based measures capture non-trivial temporal organization of semantic content beyond static lexical distributions. Our results suggest that ACW-based semantic timescales are a useful family of features for analyzing and comparing the temporal structure of human and AI-generated speech.

2606.11279 2026-06-11 eess.AS cs.CL cs.LG cs.SD 新提交

Massive Open-Vocabulary Keyword Spotting

大规模开放词汇关键词识别

Leonor Barreiros, Raul Monteiro, Afonso Mendes, Gonçalo M. Correia

AI总结 提出一种内存占用更小的开放词汇关键词识别系统,无需微调即可处理大规模数据库,在未见语言中达到与未压缩方案相当的实体召回率。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

自动语音识别系统在转录训练数据中罕见词汇(即专业术语)时表现不佳。开放词汇关键词识别结合上下文偏置已被证明可以缓解这一问题。然而,现有系统只能处理几百个术语的词汇表,否则会成为不可行的瓶颈。我们提出了一种系统,其存储特征的内存占用比可比基线小128倍,允许用户处理大规模数据库,同时保持开放词汇。无需微调语音识别模型,我们的系统在未见过的语言中也达到了与未压缩解决方案相当的实体召回率。

英文摘要

Automatic speech recognition systems have been shown to under-perform when it comes to transcribing words rarely seen in the training data, namely specialized terminology. Open-vocabulary keyword spotting, combined with contextual biasing, has been shown to mitigate this issue. However, existing systems can only handle glossaries of a few hundred terms without becoming an infeasible bottleneck. We propose a system that stores features with a memory footprint up to 128 times smaller than a comparable baseline and allows users to process massive databases while remaining open-vocabulary. Without fine-tuning the speech recognition model, our system achieves a comparable entity recall as uncompressed solutions, even in languages not seen during training.

2606.11197 2026-06-11 eess.AS cs.AI cs.CL cs.SD 新提交

MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation

MA-DLE: 基于记忆增强的语音自动抑郁程度估计

Xuzhi Wang, Xinran Wu, Ziping Zhao, Jianhua Tao, Björn W. Schuller

AI总结 提出记忆增强特征方法,通过选择性整合历史时序特征和动态记忆特征,结合层次注意力融合模块,在DAIC-WOZ和E-DAIC数据集上实现最优性能。

详情
Comments
Accepted at IEEE TAC
AI中文摘要

基于语音的抑郁程度自动估计对于实现早期检测和及时干预至关重要,尤其是在资源受限的心理健康环境中。近年来,深度学习在包括情感计算和心理健康评估在内的多个领域取得了显著成功。现有方法大多依赖基于RNN的架构(如LSTM和GRU)来建模时间信息以进行抑郁估计。然而,提取的特征往往只强调少数相邻语音片段,限制了其捕捉长程依赖的能力。为克服这一局限,我们引入了一种基于记忆的特征增强方法,以增强GRU提取特征的表示能力。我们的记忆库并非不加区分地整合历史数据,而是设计为选择性整合两类组件以减少冗余和不相关性:(1) 与当前GRU输出高度相似的历史时序特征,提供互补的上下文信息;(2) 基于特征变异性识别的动态记忆特征,捕捉指示抑郁症状的行为和情绪波动。为有效融合记忆增强特征与GRU输出,我们进一步设计了层次注意力融合(HAF)模块。我们的方法在广泛使用的DAIC-WOZ和E-DAIC数据集上进行了评估,取得了最先进的性能。

英文摘要

Speech-based automatic estimation of depression levels is essential for enabling early detection and timely intervention, particularly in resource-constrained mental health settings. In recent years, deep learning has demonstrated impressive success across various domains, including affective computing and mental health assessment. Most existing approaches rely on RNN-based architectures (such as LSTM and GRU) to model temporal information for depression estimation. However, the extracted features often emphasize only a few adjacent speech segments, limiting their ability to capture long-range dependencies. To overcome this limitation, we introduce a memory-based feature augmentation method that enhances the representational capacity of GRU-extracted features. Rather than indiscriminately incorporating historical data, our memory bank is designed to selectively integrate two types of components in order to reduce redundancy and irrelevance: (1) historical temporal features that closely resemble the current GRU output, offering complementary contextual information; and (2) dynamic memory features identified based on feature variability, which capture behavioral and emotional fluctuations indicative of depressive symptoms. To effectively fuse the memory-augmented features with GRU outputs, we further design a Hierarchical Attention Fusion (HAF) module. Our method is evaluated on the widely used DAIC-WOZ and E-DAIC datasets, achieving state-of-the-art performance.

2606.06940 2026-06-11 eess.AS cs.SD 版本更新

Beyond Semantic Dominance: Cognitive Affective Reasoning and Empathetic Response Alignment in Audio Language Models

超越语义主导:音频语言模型中的认知情感推理与共情响应对齐

Zhixian Zhao, Shuiyuan Wang, Wenjie Tian, Jingbin Hu, Ziyu Zhang, Lei Xie

AI总结 提出CogAudio-LLM框架,通过构建LIME-440K数据集实现声学-语义解耦,设计EIPS思维链机制进行心理推理,并采用DR-SAPO优化策略平衡逻辑严谨性与共情质量,解决音频语言模型中的语义主导和情感认知不足问题。

详情
Comments
Accepted by Interspeech2026
AI中文摘要

虽然音频语言模型(ALM)表现出强大的语义理解能力,但在复杂的情感交互方面仍存在困难。具体来说,文本语义主导常常掩盖声学细微差别,而缺乏认知深度导致生成通用、与情感无关的响应。我们提出了CogAudio-LLM\footnote{ \urlstyle{same} this https URL},一种新颖的认知情感推理框架。为了缓解语义主导,我们构建了LIME-440K,一个“词汇相同、多情感”的数据集,旨在促进声学-语义解耦。我们引入了EIPS,一种包含心理推理的4步思维链(CoT)机制。为了提高推理效率,多阶段训练通过监督微调显式建立EIPS,然后将这种逻辑提炼为隐式生成过程。最后,我们设计了DR-SAPO(双路径软自适应策略优化)来动态平衡CoT的逻辑严谨性与直接响应的共情质量。

英文摘要

While Audio Language Models (ALMs) demonstrate strong semantic understanding, they struggle with complex affective interactions. Specifically, textual semantic dominance often overshadows acoustic nuances, and a lack of cognitive depth leads to generic, emotion-agnostic responses. We propose CogAudio-LLM\footnote{ \urlstyle{same} this https URL, a novel cognitive affective reasoning framework. To mitigate semantic dominance, we build LIME-440K, a ``lexically-identical, multi-emotion'' dataset designed to facilitate acoustic-semantic decoupling. We introduce EIPS, a 4-step Chain-of-Thought (CoT) mechanism incorporating psychological reasoning. For inference efficiency, multi-stage training explicitly establishes EIPS via supervised fine-tuning, then distills this logic into an implicit generation process. Finally, we design DR-SAPO (Dual-Route Soft Adaptive Policy Optimization) to dynamically balance the logical rigor of the CoT with the empathetic quality of the direct response.

2606.06065 2026-06-11 cs.CL cs.SD eess.AS 版本更新

Multi-task Learning is Not Enough: Representational Entanglement in Dual-output Second Language Speech Recognition

多任务学习还不够:双输出第二语言语音识别中的表示纠缠

Seung Hwan Cho, Young-Min Kim

AI总结 针对双输出第二语言语音识别,研究发现多任务学习导致表面转录性能下降,归因于编码器级别的表示纠缠,尤其在英语中随表面-意义差异增大而加剧。

详情
Comments
5 pages, 2 figures, Accepted to the 43rd International Conference on Machine Learning Workshop on Machine Learning for Audio
AI中文摘要

第二语言(L2)语音识别通常需要发音转录和预期意义的转录。多任务学习(MTL)是一种自然的方法,因为它假设共享表示对两个输出都有益。然而,本文表明这一假设在韩语和英语中并不成立。MTL提高了意义转录但降低了表面转录,尤其是在英语中,性能下降与通过Levenshtein编辑距离测量的表面-意义差异成正比。编码器分析将这些模式与编码器级别的纠缠联系起来,韩语保留了不同的任务表示,而英语产生了几乎相同的表示。跨任务解码器分析表明,意义双输出解码器适应了独特的表示,而表面双输出解码器仍受编码器约束。这些发现促使设计能够减轻编码器级别纠缠的MTL框架,以减少双输出L2自动语音识别中的表面性能下降。

英文摘要

Second-language (L2) speech recognition often requires transcriptions of pronunciations and intended meanings. Multi-task learning (MTL) is a natural approach because it assumes that shared representations benefit both outputs. However, this paper shows that this assumption does not hold across Korean and English. MTL improves meaning but degrades surface transcription, especially in English, where the degradation scales with surface-meaning divergence measured by Levenshtein edit distance. Encoder analysis links these patterns to encoder-level entanglement, with Korean preserving distinct task representations while English produces nearly identical ones. Cross-task decoder analysis shows that the meaning dual-output decoder adapts with a unique representation, while the surface dual-output decoder remains constrained by the encoder. These findings motivate the design of MTL frameworks that mitigate encoder-level entanglement to reduce surface degradation in dual-output L2 automatic speech recognition.

2606.05394 2026-06-11 cs.SD eess.AS 版本更新

nnAudio 2: Overcoming Dynamic Compilation Barriers and Transform Inconsistencies

nnAudio 2: 克服动态编译障碍与变换不一致性

Abhinaba Roy, Junyi Liang, Dorien Herremans

AI总结 针对 nnAudio 在 TorchScript 编译、逆变换边缘情况和依赖漂移方面的问题,通过移除动态状态变异、限制逆变换适用范围并更新依赖,实现了与现代 PyTorch 和 SciPy 的兼容,提升了可微音频分析的鲁棒性。

详情
AI中文摘要

nnAudio 是一个用于深度学习的开源音频特征提取工具箱,但在当前环境中,其使用受到 TorchScript 不兼容、逆变换边缘情况和依赖漂移的阻碍。我们针对现代 PyTorch 和科学 Python 进行了有针对性的现代化改造。我们通过从脚本化代码路径中移除动态状态变异和模块构造,并收紧逆相关辅助函数中的参数处理,解决了 STFT 和 iSTFT 中的 TorchScript 编译失败问题。我们通过将可靠逆变换限制为均匀 bin 设置(freq_scale='no'),并对不支持的频率尺度引发显式运行时错误,澄清了逆 STFT 行为,防止了静默退化的重构。我们恢复了与现代 SciPy 的 CFP 兼容性,并确保当 gamma = 0 时 VQT 简化为 CQT。回归测试涵盖了新的 STFT/iSTFT 行为,更新后的代码库在现代 Python 环境中通过了完整的仓库测试套件。这些改进为研究和部署中的可微音频分析提供了更坚实的基础。

英文摘要

nnAudio is an open-source audio feature extraction toolbox for deep learning, but its use in current environments is hindered by TorchScript incompatibilities, inverse-transform edge cases, and dependency drift. We present a targeted modernization for modern PyTorch and scientific Python. We resolve TorchScript compilation failures in STFT and iSTFT by removing dynamic state mutation and module construction from scripted code paths and tightening argument handling in inverse-related helpers. We clarify inverse-STFT behavior by restricting reliable inversion to the uniform-bin setting (freq_scale=`no') and raising explicit runtime errors for unsupported frequency scales, preventing silently degraded reconstructions. We restore CFP compatibility with modern SciPy and ensure VQT reduces to CQT when gamma = 0. Regression tests cover the new STFT/iSTFT behaviors, and the updated codebase passes the full repository test suite in a modern Python environment. These improvements provide a more robust foundation for differentiable audio analysis in research and deployment.

2606.02220 2026-06-11 eess.AS 版本更新

SiamCTC: Learning Speech Representations through Monotonic Temporal Alignment

SiamCTC: 通过单调时间对齐学习语音表征

SooHwan Eom, Mark Hasegawa-Johnson, Chang D. Yoo

AI总结 提出SiamCTC框架,结合孪生网络与连接时序分类(CTC)损失,通过灵活单调对齐不同时间实现,学习无需严格帧级对应的语音表征,提升对语速变化的鲁棒性。

详情
Comments
Accepted to Interspeech 2025
AI中文摘要

通过孪生网络,自监督语音表征学习取得了显著进展,这些网络利用同一输入的不同视图。然而,现有方法通常需要这些视图之间的帧级对齐,忽略了不同说话风格下更广泛的 linguistic context 不变性。我们引入了SiamCTC,一个将孪生网络与连接时序分类(CTC)相结合的框架,用于学习无需严格帧级对应的语音表征。通过采用CTC损失在相同内容的不同时间实现之间建立灵活、单调的对齐,SiamCTC适应了速度扰动和其他时间增强。这种设计放宽了帧级约束,同时保持了时间一致性,并增强了下游任务中对语速变化的鲁棒性。我们的实验表明,SiamCTC导致了更具适应性的语音表征,特别是在不同的语速下。

英文摘要

Self-supervised speech representation learning has made significant progress through Siamese networks, which leverage different views of the same input. However, existing methods often require frame-wise alignment between these views, overlooking the broader linguistic context invariance across different speaking styles. We introduce SiamCTC, a framework that integrates Siamese networks with Connectionist Temporal Classification (CTC) to learn speech representations without strict frame-level correspondence. By employing CTC loss to establish flexible, monotonic alignments between differing temporal realizations of the same content, SiamCTC accommodates speed perturbations and other temporal augmentations. This design relaxes frame-wise constraints while preserving temporal coherence and enhancing robustness to speaking-rate variations in downstream tasks. Our experiments demonstrate that SiamCTC leads to more adaptable speech representations, particularly at diverse speaking rates.

2603.11678 2026-06-11 eess.AS cs.SD 版本更新

RAF: Relativistic Adversarial Feedback For Universal Speech Synthesis

RAF:用于通用语音合成的相对论对抗反馈

Yongjoon Lee, Jung-Woo Choi

AI总结 提出相对论对抗反馈(RAF)训练目标,通过自监督语音模型和相对论配对改进GAN声码器的域内保真度和泛化能力,在参数减少88%的情况下超越LSGAN训练的BigVGAN。

详情
Comments
Accepted to Interspeech 2026 Long paper track. Code: this https URL
AI中文摘要

我们提出相对论对抗反馈(RAF),一种用于GAN声码器的新型训练目标,可提高域内保真度和对未见场景的泛化能力。尽管现代GAN声码器采用先进架构,但其训练目标往往无法促进可泛化的表示。RAF通过利用语音自监督学习模型辅助判别器评估样本质量,鼓励生成器学习更丰富的表示来解决这一问题。此外,我们利用真实和虚假波形的相对论配对来改善训练数据分布的建模。跨多个数据集的实验表明,基于GAN的声码器在客观和主观指标上均获得一致提升。重要的是,经过RAF训练的BigVGAN-base仅使用12%的参数就在感知质量上优于经过LSGAN训练的BigVGAN。对比研究进一步证实了RAF作为GAN声码器训练框架的有效性。

英文摘要

We propose Relativistic Adversarial Feedback (RAF), a novel training objective for GAN vocoders that improves in-domain fidelity and generalization to unseen scenarios. Although modern GAN vocoders employ advanced architectures, their training objectives often fail to promote generalizable representations. RAF addresses this problem by leveraging speech self-supervised learning models to assist discriminators in evaluating sample quality, encouraging the generator to learn richer representations. Furthermore, we utilize relativistic pairing for real and fake waveforms to improve the modeling of the training data distribution. Experiments across multiple datasets show consistent gains in both objective and subjective metrics on GAN-based vocoders. Importantly, the RAF-trained BigVGAN-base outperforms the LSGAN-trained BigVGAN in perceptual quality using only 12\% of the parameters. Comparative studies further confirm the effectiveness of RAF as a training framework for GAN vocoders.

2509.15680 2026-06-11 cs.SD eess.AS 版本更新

SAM: A Mamba-2 State-Space Audio-Language Model

SAM: 一种基于 Mamba-2 状态空间的音频-语言模型

Taehan Lee, Jaehan Jung, Hyukjun Lee

AI总结 提出 SAM,一种结合 Mamba-2 骨干网络的音频-语言模型,在 AudioSet 和 AudioCaps 上以更少参数达到或超越 7B 变压器模型性能,并系统分析了 SSM 与音频编码器输出的交互机制。

详情
Comments
6 pages, Accepted to Interspeech 2026
AI中文摘要

我们提出了 SAM,一种状态空间音频-语言模型,它将音频编码器与 Mamba-2 骨干网络集成。SAM-2.7B 在 AudioSet 上达到 21.1 mAP,在 AudioCaps 上达到 17.6 SPICE,以更少的参数匹配或超越更大的 7B 变压器模型。我们进一步首次提供了系统性的、表示级别的分析,研究 SSM 如何与音频编码器输出交互:(1) 联合音频编码器微调是必要的,这由准确率提升以及在不同 SSM 大小下观察到的 token 表示秩和相似性的适应所支持;(2) 尽管线性缩放,SSM 从紧凑、信息丰富的音频 token 表示中获益更多,而非过长的 token 序列;(3) 融入指令跟随监督显著提升了推理能力,将 MMAU-Sound 准确率从 22.8 提升至 56.8。通过全面的实验和分析,我们为 SSM 作为音频-语言模型的强大、可扩展骨干网络建立了实用的设计原则。

英文摘要

We present SAM, a State-space Audio-language Model that integrates an audio encoder with a Mamba-2 backbone. SAM-2.7B achieves 21.1 mAP on AudioSet and 17.6 SPICE on AudioCaps, matching or surpassing larger 7B transformer-based models with fewer parameters. We further provide the first systematic, representation-level analysis of how SSMs interact with audio encoder outputs: (1) joint audio encoder finetuning is essential, supported by accuracy gains and observed adaptation of token representation rank and similarity across different SSM sizes; (2) despite linear scaling, SSMs benefit more from compact, information-rich audio token representations than from excessively long token sequences; and (3) incorporating instruction-following supervision substantially improves reasoning ability, boosting MMAU-Sound accuracy from 22.8 to 56.8. Through comprehensive experiments and analysis, we establish practical design principles for SSMs as strong, scalable backbones for audio-language models.

2602.00560 2026-06-11 cs.SD eess.AS 版本更新

Edit Content, Preserve Acoustics: Imperceptible Text-Based Speech Editing via Self-Consistency Rewards

编辑内容,保留声学:基于自一致性奖励的不可感知文本语音编辑

Yong Ren, Jiangyan Yi, Jianhua Tao, Tao Wang, Le Xu, Zhengqi Wen

AI总结 提出一种在稳定语义空间中编辑内容、通过流匹配解码器保持声学连续性的框架,并利用自一致性奖励组相对策略优化实现不可感知的文本语音编辑。

详情
Comments
Accepted by Interspeech 2026
AI中文摘要

不可感知的基于文本的语音编辑通过转录操作修改口语内容,同时保持声学连续性。先前的声学空间方法存在内容-风格纠缠,导致生成不稳定和边界伪影。我们引入了一个以“编辑内容,保留声学”原则为指导的框架。编辑在稳定的语义空间中进行,而声学实现由流匹配解码器处理。为了确保感知一致性,我们提出了自一致性奖励组相对策略优化,该优化利用预训练的文本到语音模型作为隐式评判器,并结合可理解性和持续时间约束。实验表明,在可理解性、鲁棒性和感知质量方面,该方法持续优于最先进的自回归和非自回归基线。

英文摘要

Imperceptible text-based speech editing modifies spoken content through transcript manipulation while preserving acoustic continuity. Prior acoustic-space approaches suffer from content-style entanglement, causing unstable generation and boundary artifacts. We introduce a framework guided by the principle of "Edit Content, Preserve Acoustics". Editing is conducted in a stable semantic space, while acoustic realization is handled by a Flow Matching decoder. To ensure perceptual consistency, we propose Self-Consistency Rewards Group Relative Policy Optimization, which leverages a pre-trained Text-to-Speech model as an implicit critic, together with intelligibility and duration constraints. Experiments demonstrate consistent improvements over state-of-the-art autoregressive and non-autoregressive baselines in intelligibility, robustness, and perceptual quality.

2510.23320 2026-06-11 eess.AS cs.CL cs.SD 版本更新

LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization

LibriConvo:从阅读文献模拟对话用于ASR和说话人日志

Máté Gedeon, Péter Mihajlik

AI总结 提出LibriConvo合成对话语音语料库,基于说话人感知模拟对话框架构建,用于说话人日志和ASR基准测试,包含240.1小时音频,基线实验显示Sortformer在日志中优于pyannote,Fast Conformer-CTC在ASR中优于Whisper。

详情
Comments
Accepted by TSD 2026
AI中文摘要

我们介绍了LibriConvo,一个用于说话人日志和自动语音识别(ASR)的合成对话语音语料库,通过在数据集和基准测试设置中实例化先前提出的说话人感知模拟对话(SASC)框架构建而成。本文的主要贡献是基于该框架的语料库构建流程和基准测试。为了使数据更适合下游ASR和说话人日志,我们使用外部语音活动检测从英语CallHome估计对话时间统计信息,压缩长停顿,按书籍分组LibriTTS话语以改善局部语义连续性,并通过空间合理性启发式选择房间脉冲响应。生成的语料库包含240.1小时的音频,涉及830个说话人的1496个对话,划分为说话人不重叠的训练、验证和测试集。我们报告了说话人日志和ASR的基线结果。在测试集上,Sortformer在说话人日志中优于pyannote流水线(DER 11.1%对比24.4%)。对于ASR,使用序列化输出训练微调的Fast Conformer-CTC XLarge模型实现了7.29%的WER和6.97%的cpWER,优于零样本Whisper-large-v3。这些结果使LibriConvo成为研究合成对话语音和评估多说话人语音处理系统的实用基准。

英文摘要

We introduce LibriConvo, a synthetic conversational speech corpus for speaker diarization and automatic speech recognition (ASR), built by instantiating the previously proposed Speaker-Aware Simulated Conversation (SASC) framework in a dataset and benchmarking setting. The main contribution of this paper is a corpus construction pipeline and benchmark derived from that framework. To make the data more suitable for downstream ASR and diarization, conversational timing statistics are estimated from English CallHome using external voice activity detection, long pauses are compressed, LibriTTS utterances are grouped by book to improve local semantic continuity, and room impulse responses are selected with a spatial-plausibility heuristic. The resulting corpus contains 240.1 hours of audio across 1,496 dialogues involving 830 speakers, partitioned into speaker-disjoint train, validation, and test splits. We report baseline results for both diarization and ASR. On the test split, Sortformer outperforms the pyannote pipeline in diarization (11.1\% vs.~24.4\% DER). For ASR, a Fast Conformer-CTC XLarge model fine-tuned with Serialized Output Training achieves 7.29\% WER and 6.97\% cpWER, outperforming zero-shot Whisper-large-v3. These results position LibriConvo as a practical benchmark for studying synthetic conversational speech and for evaluating multi-speaker speech processing systems.

2405.06995 2026-06-11 cs.SD cs.CV cs.MM eess.AS 版本更新

Benchmarking Cross-Domain Audio-Visual Deception Detection

跨域音视频欺骗检测基准测试

Xiaobao Guo, Zitong Yu, Nithish Muthuchamy Selvaraj, Bingquan Shen, Adams Wai-Kin Kong, Alex C. Kot

AI总结 提出首个跨域音视频欺骗检测基准,评估不同场景下的泛化能力,并设计MM-IDGM算法和Attention-Mixer融合方法提升性能。

详情
Comments
17 pages
AI中文摘要

自动欺骗检测对于帮助人类准确评估真实性和识别欺骗行为至关重要。传统的接触式技术,如测谎仪,依赖生理信号来确定个体陈述的真实性。然而,自动欺骗检测的最新进展表明,从音频和视频模态中提取的多模态特征在公开数据集上可能优于人类观察者。尽管有这些积极发现,现有音视频欺骗检测方法在不同场景下的泛化能力仍 largely unexplored。为弥补这一空白,我们提出了首个跨域音视频欺骗检测基准,使我们能够评估这些方法在现实场景中的泛化能力。我们使用了广泛采用的音频和视觉特征以及不同的架构进行基准测试,比较了单到单和多到单域泛化性能。为了进一步利用来自多个源域的数据进行训练的影响,我们研究了三种域采样策略,包括域同步、域交替和逐域采样,用于多到单域泛化评估。我们还提出了一种通过最大化模态编码器之间的梯度内积来增强泛化性能的算法,称为“MM-IDGM”。此外,我们提出了Attention-Mixer融合方法来提高性能,并相信这一新的跨域基准将促进未来音视频欺骗检测的研究。

英文摘要

Automated deception detection is crucial for assisting humans in accurately assessing truthfulness and identifying deceptive behavior. Conventional contact-based techniques, like polygraph devices, rely on physiological signals to determine the authenticity of an individual's statements. Nevertheless, recent developments in automated deception detection have demonstrated that multimodal features derived from both audio and video modalities may outperform human observers on publicly available datasets. Despite these positive findings, the generalizability of existing audio-visual deception detection approaches across different scenarios remains largely unexplored. To close this gap, we present the first cross-domain audio-visual deception detection benchmark, that enables us to assess how well these methods generalize for use in real-world scenarios. We used widely adopted audio and visual features and different architectures for benchmarking, comparing single-to-single and multi-to-single domain generalization performance. To further exploit the impacts using data from multiple source domains for training, we investigate three types of domain sampling strategies, including domain-simultaneous, domain-alternating, and domain-by-domain for multi-to-single domain generalization evaluation. We also propose an algorithm to enhance the generalization performance by maximizing the gradient inner products between modality encoders, named ``MM-IDGM". Furthermore, we proposed the Attention-Mixer fusion method to improve performance, and we believe that this new cross-domain benchmark will facilitate future research in audio-visual deception detection.

2406.07909 2026-06-11 eess.AS cs.CL cs.SD stat.ML

Guiding Frame-Level CTC Alignments Using Self-knowledge Distillation

Eungbeom Kim, Hantae Kim, Kyogu Lee

详情
Comments
Accepted by Interspeech 2024
英文摘要

Transformer encoder with connectionist temporal classification (CTC) framework is widely used for automatic speech recognition (ASR). However, knowledge distillation (KD) for ASR displays a problem of disagreement between teacher-student models in frame-level alignment which ultimately hinders it from improving the student model's performance. In order to resolve this problem, this paper introduces a self-knowledge distillation (SKD) method that guides the frame-level alignment during the training time. In contrast to the conventional method using separate teacher and student models, this study introduces a simple and effective method sharing encoder layers and applying the sub-model as the student model. Overall, our approach is effective in improving both the resource efficiency as well as performance. We also conducted an experimental analysis of the spike timings to illustrate that the proposed method improves performance by reducing the alignment disagreement.

2305.13108 2026-06-11 eess.AS cs.CL cs.LG cs.SD

Debiased Automatic Speech Recognition for Dysarthric Speech via Sample Reweighting with Sample Affinity Test

Eungbeom Kim, Yunkee Chae, Jaeheon Sim, Kyogu Lee

详情
Comments
Accepted by Interspeech 2023
英文摘要

Automatic speech recognition systems based on deep learning are mainly trained under empirical risk minimization (ERM). Since ERM utilizes the averaged performance on the data samples regardless of a group such as healthy or dysarthric speakers, ASR systems are unaware of the performance disparities across the groups. This results in biased ASR systems whose performance differences among groups are severe. In this study, we aim to improve the ASR system in terms of group robustness for dysarthric speakers. To achieve our goal, we present a novel approach, sample reweighting with sample affinity test (Re-SAT). Re-SAT systematically measures the debiasing helpfulness of the given data sample and then mitigates the bias by debiasing helpfulness-based sample reweighting. Experimental results demonstrate that Re-SAT contributes to improved ASR performance on dysarthric speech without performance degradation on healthy speech.