arXivDaily arXiv每日学术速递 周一至周五更新
2604.19635 2026-04-22 cs.SD cs.AI 版本更新

Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model

Shuhai Peng, Hui Lu, Jinjiang Liu, Liyang Chen, Guiping Zhong, Jiakui Li, Huimeng Wang, Haiyun Li, Liang Cao, Shiyin Kang, Zhiyong Wu

详情
英文摘要

While generative models have set new benchmarks for Target Speaker Extraction (TSE), their inherent reliance on global context precludes deployment in real-time applications. Direct adaptation to streaming scenarios often leads to catastrophic inference performance degradation due to the severe mismatch between training and streaming inference. To bridge this gap, we present the first autoregressive (AR) models tailored for streaming TSE. Our approach introduces a Chunk-wise Interleaved Splicing Paradigm that ensures highly efficient and stable streaming inference. To ensure the coherence between the extracted speech segments, we design a historical context refinement mechanism that mitigates boundary discontinuities by leveraging historical information. Experiments on Libri2Mix show that while AR generative baseline exhibits performance degradation at low latencies, our approach maintains 100% stability and superior intelligibility. Furthermore, our streaming results are comparable to or even surpass offline baselines. Additionally, our model achieves a Real-Time-Factor (RTF) of 0.248 on consumer-level GPUs. This work provides empirical evidence that AR generative backbones are viable for latency-sensitive applications through the Chunk-wise Interleaved Splicing Paradigm.

2604.19477 2026-04-22 cs.SD cs.CL 版本更新

Deep Supervised Contrastive Learning of Pitch Contours for Robust Pitch Accent Classification in Seoul Korean

Hyunjung Joo, GyeongTaek Lee

详情
英文摘要

The intonational structure of Seoul Korean has been defined with discrete tonal categories within the Autosegmental-Metrical model of intonational phonology. However, it is challenging to map continuous $F_0$ contours to these invariant categories due to variable $F_0$ realizations in real-world speech. Our paper proposes Dual-Glob, a deep supervised contrastive learning framework to robustly classify fine-grained pitch accent patterns in Seoul Korean. Unlike conventional local predictive models, our approach captures holistic $F_0$ contour shapes by enforcing structural consistency between clean and augmented views in a shared latent space. To this aim, we introduce the first large-scale benchmark dataset, consisting of manually annotated 10,093 Accentual Phrases in Seoul Korean. Experimental results show that our Dual-Glob significantly outperforms strong baseline models with state-of-the-art accuracy (77.75%) and F1-score (51.54%). Therefore, our work supports AM-based intonational phonology using data-driven methodology, showing that deep contrastive learning effectively captures holistic structural features of continuous $F_0$ contours.

2512.19442 2026-04-22 eess.SP cs.LG cs.SD 版本更新

Real-Time Streamable Generative Speech Restoration with Flow Matching

Simon Welker, Bunlong Lay, Maris Hillemann, Tal Peer, Timo Gerkmann

Comments This work has been submitted to the IEEE for possible publication

详情
英文摘要

Diffusion-based generative models have greatly impacted the speech processing field in recent years, exhibiting high speech naturalness and spawning a new research direction. Their application in real-time communication is, however, still lagging behind due to their computation-heavy nature involving multiple calls of large DNNs. Here, we present Stream$.$FM, a frame-causal flow-based generative model with an algorithmic latency of 32 milliseconds (ms) and a total latency of 48 ms, paving the way for generative speech processing in real-time communication. We propose a buffered streaming inference scheme and an optimized DNN architecture, show how learned few-step numerical solvers can boost output quality at a fixed compute budget, explore model weight compression to find favorable points along a compute/quality tradeoff, and contribute a model variant with 24 ms total latency for the speech enhancement task. Our work looks beyond theoretical latencies, showing that high-quality streaming generative speech processing can be realized on consumer GPUs available today. Stream$.$FM can solve a variety of speech processing tasks in a streaming fashion: speech enhancement, dereverberation, codec post-filtering, bandwidth extension, STFT phase retrieval, and Mel vocoding. As we verify through comprehensive evaluations and a MUSHRA listening test, Stream$.$FM establishes a state-of-the-art for generative streaming speech restoration, exhibits only a reasonable reduction in quality compared to a non-streaming variant, and outperforms our recent work (Diffusion Buffer) on generative streaming speech enhancement while operating at a lower latency.

2509.04072 2026-04-22 eess.AS cs.CL cs.SD 版本更新

Computational Narrative Understanding for Expressive Text-to-Speech

Gaspard Michel, Elena V. Epure, Christophe Cerisara

Comments Findings of ACL 2026

详情
英文摘要

Recent advances in text-to-speech (TTS) have been driven by large, multi-domain speech corpora, yet the expressive potential of audiobook data remains underexamined. We argue that human-narrated audiobooks, particularly fictional works, contain rich and diverse prosodic cues arising from the natural alternation between neutral narration and expressive character dialogue. Building from this observation, we introduce LibriQuote, a large-scale 5.3K hours of expressive speech drawn from character quotations. Each quote is supplemented with contextual pseudo-labels for speech verbs and adverbs that characterize the intended delivery of direct speech (e.g., "he whispered softly"). We found that fine-tuning a flow-matching model on LibriQuote yields substantial improvements in expressivity and intelligibility, while training from scratch enhances expressiveness of an autoregressive TTS model. Benchmarking on LibriQuote-test highlights significant variability across systems in generating expressive speech. We publicly release the dataset, code, and evaluation resources to facilitate reproducibility. Audio samples can be found at https://libriquote.github.io/.

2503.23439 2026-04-22 cs.CL cs.AI cs.LG cs.SD eess.AS 版本更新

Speculative End-Turn Detector for Efficient Speech Chatbot Assistant

Hyunjong Ok, Suho Yoo, Jaeho Lee

Comments ACL 2026

详情
英文摘要

Spoken dialogue systems powered by large language models have demonstrated remarkable abilities in understanding human speech and generating appropriate spoken responses. However, these systems struggle with end-turn detection (ETD) -- the ability to distinguish between user turn completion and hesitation. This limitation often leads to premature or delayed responses, disrupting the flow of spoken conversations. In this paper, we introduce the ETD Dataset, the first public dataset for end-turn detection. The ETD dataset consists of both synthetic speech data generated with text-to-speech models and real-world speech data collected from web sources. We also propose SpeculativeETD, a novel collaborative inference framework that balances efficiency and accuracy to improve real-time ETD in resource-constrained environments. Our approach jointly employs a lightweight GRU-based model, which rapidly detects the non-speaking units in real-time on local devices, and a high-performance Wav2vec-based model running on the server to make a more challenging classification of distinguishing turn ends from mere pauses. Experiments demonstrate that the proposed SpeculativeETD significantly improves ETD accuracy while keeping the required computations low. Datasets and code will be available after the review.

2406.14294 2026-04-22 cs.SD cs.AI eess.AS 版本更新

DASB - Discrete Audio and Speech Benchmark

Pooneh Mousavi, Jarod Duret, Darius Petermann, Artem Ploujnikov, Luca Della Libera, Anastasia Kuznetsova, Cem Subakan, Mirco Ravanelli

详情
英文摘要

Discrete audio tokens have recently gained considerable attention for their potential to bridge audio and language processing, enabling multimodal language models that can both generate and understand audio. However, preserving key information such as phonetic content, speaker identity, and paralinguistic cues remains a major challenge. Identifying the optimal tokenizer and configuration is further complicated by inconsistent evaluation settings across existing studies. To address this, we introduce the Discrete Audio and Speech Benchmark (DASB), a comprehensive framework for benchmarking discrete audio tokens across speech, general audio, and music domains on a range of discriminative and generative tasks. Our results show that discrete representations are less robust than continuous ones and require careful tuning of factors such as model architecture, data size, learning rate, and capacity. Semantic tokens generally outperform acoustic tokens, but a gap remains between discrete tokens and continuous features, highlighting the need for further research. DASB codes, evaluation setup, and leaderboards are publicly available at https://poonehmousavi.github.io/DASB-website/.

2604.19300 2026-04-22 cs.SD cs.AI 版本更新

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

Feiyu Zhao, Yiming Chen, Wenhuan Lu, Daipeng Zhang, Xianghu Yue, Jianguo Wei

Comments Accepted to ACL 2026

详情
英文摘要

Large Audio-Language Models (LALMs) have recently achieved strong performance across various audio-centric tasks. However, hallucination, where models generate responses that are semantically incorrect or acoustically unsupported, remains largely underexplored in the audio domain. Existing hallucination benchmarks mainly focus on text or vision, while the few audio-oriented studies are limited in scale, modality coverage, and diagnostic depth. We therefore introduce HalluAudio, the first large-scale benchmark for evaluating hallucinations across speech, environmental sound, and music. HalluAudio comprises over 5K human-verified QA pairs and spans diverse task types, including binary judgments, multi-choice reasoning, attribute verification, and open-ended QA. To systematically induce hallucinations, we design adversarial prompts and mixed-audio conditions. Beyond accuracy, our evaluation protocol measures hallucination rate, yes/no bias, error-type analysis, and refusal rate, enabling a fine-grained analysis of LALM failure modes. We benchmark a broad range of open-source and proprietary models, providing the first large-scale comparison across speech, sound, and music. Our results reveal significant deficiencies in acoustic grounding, temporal reasoning, and music attribute understanding, underscoring the need for reliable and robust LALMs.

2604.19209 2026-04-22 cs.SD 版本更新

Audio Spoof Detection with GaborNet

Waldek Maciejko

Comments Industrial conference materials

详情
英文摘要

An direction of development in the extraction of features from audio signals is based on processing raw samples in the time domain. Such an approach appears to be effective, especially in the era of neural networks. An example is SincNet. In this solution, the core of the neural network layer is a set of sinc functions that are convolved with the input signal. Due to the finite length of sinc functions, distortions appear in the frequency domain of the convolved signal, the same as in the case of windowing the signal. Recently, a new approach has been developed that uses Gabor filters to replace sinc functions. Due to the complex results, further modifications had to be applied, such as squared modulus or Gaussian Lowpass Pooling. In this work, an ingestion layer based on a bank of Gabor filters, named GaborNet, and its modifications are intensively examined within the popular RawNet2 and RawGAT- ST architectures. These have been developed for the purpose of audio spoof detection. Another issue that has been investigated was audio augmentation using codec conversions, room responses, and additive noises.

2604.19055 2026-04-22 cs.SD 版本更新

ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis

Aoduo Li, Haoran Lv, Shengmin Li, Sihao Qin, Hongjian Xu

Comments 10 pages, 6 figures. Accepted to ACM ICMR 2026

详情
英文摘要

High-fidelity character voice synthesis is a cornerstone of immersive multimedia applications, particularly for interacting with anime avatars and digital humans. However, existing systems struggle to maintain consistent persona traits across diverse emotional contexts. To bridge this gap, we present ATRIE, a unified framework utilizing a Persona-Prosody Dual-Track (P2-DT) architecture. Our system disentangles generation into a static Timbre Track (via Scalar Quantization) and a dynamic Prosody Track (via Hierarchical Flow-Matching), distilled from a 14B LLM teacher. This design enables robust identity preservation (Zero-Shot Speaker Verification EER: 0.04) and rich emotional expression. Evaluated on our extended AnimeTTS-Bench (50 characters), ATRIE achieves state-of-the-art performance in both generation and cross-modal retrieval (mAP: 0.75), establishing a new paradigm for persona-driven multimedia content creation.

2604.18932 2026-04-22 cs.SD cs.AI 版本更新

Tadabur: A Large-Scale Quran Audio Dataset

Faisal Alherran

Comments Project page: https://fherran.github.io/tadabur/

详情
英文摘要

Despite growing interest in Quranic data research, existing Quran datasets remain limited in both scale and diversity. To address this gap, we present Tadabur, a large-scale Quran audio dataset. Tadabur comprises more than 1400+ hours of recitation audio from over 600 distinct reciters, providing substantial variation in recitation styles, vocal characteristics, and recording conditions. This diversity makes Tadabur a comprehensive and representative resource for Quranic speech research and analysis. By significantly expanding both the total duration and variability of available Quran data, Tadabur aims to support future research and facilitate the development of standardized Quranic speech benchmarks.

2604.18665 2026-04-22 cs.SD 版本更新

APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track

Deshui Miao, Yameng Gu, Chao Yang, Xin Li, Haijun Zhang, Ming-Hsuan Yang

详情
英文摘要

This report presents an Audio-aware Referring Video Object Segmentation (Ref-VOS) pipeline tailored to the MEVIS\_Audio setting, where the referring expression is provided in spoken form rather than as clean text. Compared with a standard Sa2VA-based Ref-VOS pipeline, the proposed system introduces two additional front-end stages: speech transcription and visual existence verification. Specifically, we first employ VibeVoice-ASR to convert long-form spoken input into a structured textual transcript. Since audio-derived queries are inherently noisy and may describe entities that are not visually present in the video, we then introduce an Omni-based judgment module to determine whether the transcribed target can be grounded in the visual content. If the target is judged to be absent, the pipeline terminates early and outputs all-zero masks. Otherwise, the transcript is transformed into a segmentation-oriented prompt and fed into Sa2VA to obtain a coarse mask trajectory over the full video. Importantly, this trajectory is treated as an initial semantic hypothesis rather than a final prediction. On top of it, an agentic refinement layer evaluates query reliability, temporal relevance, anchor quality, and potential error sources, and may invoke SAM3 to improve spatial boundary precision and temporal consistency. The resulting framework explicitly decomposes the MEVIS\_Audio task into audio-to-text conversion, visual existence verification, coarse video segmentation, and agent-guided refinement. Such a staged design is substantially more appropriate for audio-conditioned Ref-VOS than directly sending noisy ASR outputs into a segmentation model.

2604.18636 2026-04-22 cs.SD cs.LG 版本更新

Virtual boundary integral neural network for three-dimensional exterior acoustic problems

Jiahao Li, Qiang Xi, Ilia Marchevskiy, Zhuojia Fu

详情
英文摘要

This paper presents a virtual boundary integral neural network (VBINN) for exterior acoustic problems in three dimensions. The method introduces a virtual boundary inside the scatterer or vibrating body and represents the associated source density with a neural network. Coupled with the acoustic fundamental solution, this representation satisfies the Sommerfeld radiation condition by construction and enables direct evaluation of the acoustic pressure and its normal derivative at arbitrary field points. Because the integration surface is separated from the physical boundary, the formulation avoids the singular and near singular kernel evaluations associated with coincident source and collocation points in conventional boundary integral learning methods. To reduce sensitivity to boundary placement, the geometric parameters of the virtual boundary are optimized jointly with the source density during training. Numerical examples for acoustic scattering, multiple body interaction, and underwater acoustic propagation show close agreement with analytical solutions and COMSOL results, and the Burton Miller extension further improves stability near characteristic frequencies. These results demonstrate the potential of VBINN for exterior acoustic analysis in three dimensions.

2604.18631 2026-04-22 cs.SD 版本更新

Towards Revised Tempo Indications for Beethoven's Piano and Cello Sonatas: Czerny, Moscheles, Kolisch, and Recorded Practice 1930-2012

Ignasi Sole

详情
英文摘要

Historical metronome indications for Beethoven's five piano and cello sonatas (as transmitted by Czerny, Moscheles, and Kolisch), have long been regarded as problematic by performers and scholars alike. This paper presents the first systematic empirical assessment of those indications against a corpus of over one hundred movement-level recordings spanning 1930--2012, encompassing first, second, and third movements across all five sonatas (Op.~5 Nos.~1 and~2; Op.~69; Op.~102 Nos.~1 and~2). The core findings are threefold. First, Czerny's and Moscheles's markings are consistently and substantially exceeded by the entire recording corpus: gaps of 15--39\% are documented across movements, with the largest divergences in slow Adagio movements and the smallest in fast Allegro finales. Second, Kolisch's 1943 markings align considerably more closely with recorded practice than either Czerny's or Moscheles's, a striking result given that Kolisch was reasoning without corpus data. Third, the central Allegro tempo traditions for each movement are stable across eight decades; not because all performers play alike, but because three coexisting slow, mid-range, and fast traditions persist simultaneously, with the mid-range dominant throughout. Building on these findings, this paper proposes a set of revised tempo indications grounded in the statistical modal tempi of the corpus, presented as ranges reflecting the documented spectrum of expert interpretive practice rather than single prescriptive values. These indications are offered not as claims about Beethoven's intentions but as evidence-based reference points for performers and scholars navigating the gap between historical prescription and performable reality.

2604.18630 2026-04-22 cs.SD 版本更新

A Complementary Visualisation Suite for Empirical Performance Analysis: Tempographs, Histograms, Ridgeline Plots, Stacked Bar Charts, and Combination Charts Applied to Beethoven's Piano and Cello Sonatas

Ignasi Sole

详情
英文摘要

The choice of visualisation in empirical performance analysis is not a neutral presentation decision but an analytical one: different graphical forms reveal different features of the same dataset, and reliance on any single type systematically conceals what the others expose. This paper presents and argues for a suite of five complementary visualisation tools; tempographs, histograms with spline-smoothed probability density functions, ridgeline plots, stacked bar charts, and combination charts. These are applied to bar-level beats-per-minute data from recordings of Beethoven's five piano and cello sonatas (Op.~5 Nos.~1 and~2; Op.~69; Op.~102 Nos.~1 and~2) spanning 1930--2012. Each tool is described formally, its analytical properties characterised, its implementation detailed in working Python and MATLAB code, and its specific contribution demonstrated on a worked example using two recordings of Op.~5 No.~1 (Casals/Horszowski 1930--39 and Isserlis/Levin 2012) separated by eight decades. A five-panel composite figure applies all five tools to the same two recordings simultaneously, making the complementarity argument concrete: the tempograph reveals moment-to-moment structural parallels invisible in aggregate statistics; the spline-smoothed histogram exposes bimodality and secondary peaks suppressed by binning artefacts; the ridgeline plot positions both recordings within the full distributional space; the stacked bar chart shows divergent sectional pacing concealed by identical movement means; and the combination chart integrates mean tempo, variability, and historical reference marks in a single view. The spline-CDF smoothing method, applied to histogram data via cubic spline interpolation with zero-slope boundary conditions, is presented as a novel contribution to the performance analysis toolkit. Full implementation code is publicly available.

2603.14432 2026-04-22 cs.SD 版本更新

Affectron: Emotional Speech Synthesis with Affective and Contextually Aligned Nonverbal Vocalizations

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee

Comments Accepted to Findings of ACL 2026

详情
英文摘要

Nonverbal vocalizations (NVs), such as laughter and sighs, are central to the expression of affective cues in emotional speech synthesis. However, learning diverse and contextually aligned NVs remains challenging in open settings due to limited NV data and the lack of explicit supervision. Motivated by this challenge, we propose Affectron as a framework for affective and contextually aligned NV generation. Built on a small-scale open and decoupled corpus, Affectron introduces an NV-augmented training strategy that expands the distribution of NV types and insertion locations. We further incorporate NV structural masking into a speech backbone pre-trained on purely verbal speech to enable diverse and natural NV synthesis. Experimental results demonstrate that Affectron produces more expressive and diverse NVs than baseline systems while preserving the naturalness of the verbal speech stream.

2512.06380 2026-04-22 cs.SD cs.AI 版本更新

Protecting Bystander Privacy via Selective Hearing in Audio LLMs

Xiao Zhan, Guangzhi Sun, Jose Such, Phil Woodland

Comments To Appear at ACL 2026 main conference; Dataset: https://huggingface.co/datasets/BrianatCambridge/SelectiveHearingBench

详情
英文摘要

Audio Large language models (LLMs) are increasingly deployed in the real world, where they inevitably capture speech from unintended nearby bystanders, raising privacy risks that existing benchmarks and defences did not consider. We introduce SH-Bench, the first benchmark designed to evaluate selective hearing: a model's ability to attend to an intended main speaker while refusing to process or reveal information about incidental bystander speech. SH-Bench contains 3,968 multi-speaker audio mixtures, including both real-world and synthetic scenarios, paired with 77k multiple-choice questions that probe models under general and selective operating modes. In addition, we propose Selective Efficacy (SE), a novel metric capturing both multi-speaker comprehension and bystander-privacy protection. Our evaluation of state-of-the-art open-source and proprietary LLMs reveals substantial bystander privacy leakage, with strong audio understanding failing to translate into selective protection of bystander privacy. To mitigate this gap, we also present Bystander Privacy Fine-Tuning (BPFT), a novel training pipeline that teaches models to refuse bystander-related queries without degrading main-speaker comprehension. We show that BPFT yields substantial gains, achieving an absolute 47% higher bystander accuracy under selective mode and an absolute 16% higher SE compared to Gemini 2.5 Pro, which is the best audio LLM without BPFT. Together, SH-Bench and BPFT provide the first systematic framework for measuring and improving bystander privacy in audio LLMs.