arXivDaily arXiv每日学术速递 周一至周五更新
2604.22290 2026-04-27 cs.SD cs.MM eess.AS 版本更新

Transformer-Based Rhythm Quantization of Performance MIDI Using Beat Annotations

Maximilian Wachter, Sebastian Murgul, Michael Heizmann

Comments Accepted to the 5th International Conference on SMART MULTIMEDIA (ICSM), 2025

详情
英文摘要

Rhythm transcription is a key subtask of notation-level Automatic Music Transcription (AMT). While deep learning models have been extensively used for detecting the metrical grid in audio and MIDI performances, beat-based rhythm quantization remains largely unexplored. In this work, we introduce a novel deep learning approach for quantizing MIDI performances using a priori beat information. Our method leverages the transformer architecture to effectively process synchronized score and performance data for training a quantization model. Key components of our approach include dataset preparation, a beat-based pre-quantization method to align performance and score times within a unified framework, and a MIDI tokenizer tailored for this task. We adapt a transformer model based on the T5 architecture to meet the specific requirements of rhythm quantization. The model is evaluated using a set of score-level metrics designed for objective assessment of quantization performance. Through systematic evaluation, we optimize both data representation and model architecture. Additionally, we apply performance and score augmentations, such as transposition, note deletion, and performance-side time jitter, to enhance the model's robustness. Finally, a qualitative analysis compares our model's quantization performance against state-of-the-art probabilistic and deep-learning models on various example pieces. Our model achieves an onset F1-score of 97.3% and a note value accuracy of 83.3% on the ASAP dataset. It generalizes well across time signatures, including those not seen during training, and produces readable score output. Fine-tuning on instrument-specific datasets further improves performance by capturing characteristic rhythmic and melodic patterns. This work contributes a robust and flexible framework for beat-based MIDI quantization using transformer models.

2604.22276 2026-04-27 eess.AS cs.SD 版本更新

Audio Effect Estimation with DNN-Based Prediction and Search Algorithm

Youichi Okita, Haruhiro Katayose

Comments Accepted for ICASSP2026

Journal ref Proceedings of the 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 15952-15956, 2026

详情
英文摘要

Audio effects play an essential role in sound design. This research addresses the task of audio effect estimation, which aims to estimate the configuration of applied effects from a wet signal. Existing approaches to this problem can be categorized into predictive approaches, which use models pre-trained in a data-driven manner, and search-based approaches, which are based on wet signal reconstruction. In this study, we propose a novel approach that integrates these approaches: first, DNNs predict the dry signal and effect configuration, and then a search is performed based on wet signal reconstruction using these predictions. By estimating the dry signal in the prediction stage, it becomes possible to complement or improve the predictions using reconstruction similarity as an objective function. The experimental evaluation showed that methods based on the proposed approach outperformed the method solely based on the predictive approach. Furthermore, the findings suggest that the task division of predicting the effect type combination followed by the search-based estimation of order and parameters was the most effective across various metrics.

2604.22209 2026-04-27 eess.AS cs.AI cs.CL cs.SD 版本更新

UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

Chunyu Qiang, Xiaopeng Wang, Kang Yin, Yuzhe Liang, Yuxin Guo, Teng Ma, Ziyu Zhang, Tianrui Wang, Cheng Gong, Yushen Chen, Ruibo Fu, Chen Zhang, Longbiao Wang, Jianwu Dang

Comments Accepted to ACL 2026 main conference (oral)

详情
英文摘要

Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce UniSonate, a unified flow-matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference-free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). Coupled with a multi-stage curriculum learning strategy, this approach effectively mitigates cross-modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state-of-the-art performance in instruction-based TTS (WER 1.47%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe positive transfer, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single-task baselines. Audio samples are available at https://qiangchunyu.github.io/UniSonate/.

2604.22203 2026-04-27 eess.AS cs.SD 版本更新

Advancing automatic speech recognition using feature fusion with self-supervised learning features: A case study on Fearless Steps Apollo corpus

Szu-Jui Chen, John H. L. Hansen

Comments Accepted to Speech Communication 2026

Journal ref Speech Communication 180 (2026) 103380

详情
英文摘要

Using self-supervised learning (SSL) models has significantly improved performance for downstream speech tasks, surpassing the capabilities of traditional hand-crafted features. This study investigates the amalgamation of SSL models, with the aim to leverage both their individual strengths and refine extracted features to achieve improved speech recognition models for naturalistic scenarios. Our research investigates the massive naturalistic Fearless Steps (FS) APOLLO resource, with particular focus on the FS Challenge (FSC) Phase-4 corpus, providing the inaugural analysis of this dataset. Additionally, we incorporate the CHiME-6 dataset to evaluate performance across diverse naturalistic speech scenarios. While exploring previously proposed Feature Refinement Loss and fusion methods, we found these methods to be less effective on the FSC Phase-4 corpus. To address this, we introduce a novel deep cross-attention (DCA) fusion method, designed to elevate performance, especially for the FSC Phase-4 corpus. Our objective is to foster creation of superior FS APOLLO community resources, catering to the diverse needs of researchers across various disciplines. The proposed solution achieves an absolute +1.1% improvement in WER, providing effective meta-data creation for the massive FS APOLLO community resource.

2604.22133 2026-04-27 eess.AS cs.SD 版本更新

Beyond Acoustic Sparsity and Linguistic Bias: A Prompt-Free Paradigm for Mispronunciation Detection and Diagnosis

Haopeng Geng, Longfei Yang, Xi Chen, Haitong Sun, Daisuke Saito, Nobuaki Minematsu

详情
英文摘要

Mispronunciation Detection and Diagnosis (MDD) requires modeling fine-grained acoustic deviations. However, current ASR-derived MDD systems often face inherent limitations. In particular, CTC-based models favor sequence-level alignments that neglect transient mispronunciation cues, while explicit canonical priors bias predictions toward intended targets. To address these bottlenecks, we propose a prompt-free framework decoupling acoustic fidelity from canonical guidance. First, we introduce CROTTC, an acoustic model enforcing monotonic, frame-level alignment to accurately capture pronunciation deviations. Second, we implicitly inject mispronunciation information via the IF strategy under the knowledge transfer principle. Experiments show CROTTC-IF achieves a 71.77% F1-score on L2-ARCTIC and 71.70% F1-score on the Iqra'Eval2 leaderboard. With empirical analysis, we demonstrate that decoupling acoustics from explicit priors provides highly robust MDD.

2604.22037 2026-04-27 cs.SD eess.AS 版本更新

Spectrographic Portamento Gradient Analysis: A Quantitative Method for Historical Cello Recordings with Application to Beethoven's Piano and Cello Sonatas, 1930--2012

Ignasi Sole

详情
英文摘要

Portamento in string performance has been studied primarily as a binary presence-or-absence phenomenon, with existing research measuring frequency of occurrence and, less commonly, duration in milliseconds. This paper introduces a third quantitative descriptor; the spectrographic gradient of the portamento slide, measured in Hz/second, and demonstrates its measurement using a protocol combining Sonic Visualizer's melodic spectrogram layer, GIMP pixel analysis, and metric calibration against the spectrogram's known frequency axis. The gradient captures what duration alone cannot: the steepness of the pitch trajectory, which encodes the expressive character of the slide independently of its length. Applied to the opening measures of. Specifically because their monophonic texture permits reliable spectrographic pitch tracking. The method yields gradient values ranging from approximately 600~Hz/s in late-period recordings to over 4,000~Hz/s in early twentieth-century performances. The paper further documents a gain-recovery protocol that extends the analysable corpus to analogue recordings from the 1930s where portamento traces are faint in digital transfer. Applying the method to a corpus of 22 recordings spanning 1930--2012, the paper tests the hypothesis that gradient steepness correlates negatively with tempo: that slower performances produce steeper, longer slides while faster performances produce shallower slides or none at all. The results support this hypothesis, suggesting that the widely documented decline of portamento across the twentieth century is not a binary transition from presence to absence but a continuou

2604.11594 2026-04-27 eess.AS cs.SD 版本更新

HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models

Shuiyuan Wang, Zhixian Zhao, Hongfei Xue, Chengyou Wang, Shuai Wang, Hui Bu, Xin Xu, Lei Xie

详情
英文摘要

Evaluating the emotional intelligence (EI) of audio language models (ALMs) is critical. However, existing benchmarks mostly rely on synthesized speech, are limited to single-turn interactions, and depend heavily on open-ended scoring. This paper proposes HumDial-EIBench, a comprehensive benchmark for evaluating ALMs' EI. Using real-recorded human dialogues from the ICASSP 2026 HumDial Challenge, it reformulates emotional tracking and causal reasoning into multiple-choice questions with adversarial distractors, mitigating subjective scoring bias for cognitive tasks. It retains the generation of empathetic responses and introduces an acoustic-semantic conflict task to assess robustness against contradictory multimodal signals. Evaluations of eight ALMs reveal that most models struggle with multi-turn emotional tracking and implicit causal reasoning. Furthermore, all models exhibit decoupled textual and acoustic empathy, alongside a severe text-dominance bias during cross-modal conflicts.

2604.11103 2026-04-27 cs.SD cs.AI 版本更新

ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing

Xi Chen, Wei Xue, Yike Guo

详情
英文摘要

Role-playing has garnered rising attention as it provides a strong foundation for human-machine interaction and facilitates sociological research. However, current work is confined to textual modalities, neglecting speech, which plays a predominant role in daily life, thus limiting genuine role-playing. To bridge this gap, we conceptualize and benchmark speech role-playing through ActorMindBench, and we present a corresponding reasoning framework, called ActorMind. Specifically, (1) Speech Role-Playing enables models to deliver spontaneous responses with personalized verbal traits based on their role, the scene, and spoken dialogue. (2) ActorMindBench is a hierarchical benchmark comprises Utterance-Level content with 7,653 utterances, Scene-Level content with 313 scenes, and Role-Level content with 6 roles. (3) ActorMind is an off-the-shelf, multi-agent, chain-of-though style reasoning framework that emulates how human actors perform in theaters. Concretely, ActorMind first reads its assigned role description via Eye Agent, then comprehends emotional cues within contextual spoken dialogues through Ear Agent. Subsequently, Brain Agent generates a descriptive emotional state, and finally, Mouth Agent delivers the scripts infused with corresponding emotion state. Experimental results demonstrate the effectiveness of ActorMind in enhancing speech role-playing.

2505.14351 2026-04-27 cs.SD cs.AI cs.CL eess.AS 版本更新

FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation

Yutong Liu, Ziyue Zhang, Ban Ma-bao, Yuqing Cai, Yongbin Yu, Renzeng Duojie, Xiangxiang Wang, Fan Gao, Cheng Huang, Nyima Tashi

Comments This paper has been substantially restructured using a revised writing style. In addition, considering that maintaining two preprints simultaneously may not fully align with academic publishing ethics, we have withdrawn the previous version. Please refer to the updated manuscript at: arXiv:509.18060

详情
英文摘要

Tibetan is a low-resource language with minimal parallel speech corpora spanning its three major dialects-Ü-Tsang, Amdo, and Kham-limiting progress in speech modeling. To address this issue, we propose FMSD-TTS, a few-shot, multi-speaker, multi-dialect text-to-speech framework that synthesizes parallel dialectal speech from limited reference audio and explicit dialect labels. Our method features a novel speaker-dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects while preserving speaker identity. Extensive objective and subjective evaluations demonstrate that FMSD-TTS significantly outperforms baselines in both dialectal expressiveness and speaker similarity. We further validate the quality and utility of the synthesized speech through a challenging speech-to-speech dialect conversion task. Our contributions include: (1) a novel few-shot TTS system tailored for Tibetan multi-dialect speech synthesis, (2) the public release of a large-scale synthetic Tibetan speech corpus generated by FMSD-TTS, and (3) an open-source evaluation toolkit for standardized assessment of speaker similarity, dialect consistency, and audio quality.

2501.07557 2026-04-27 cs.SD cs.CY eess.AS physics.soc-ph 版本更新

Decoding Musical Evolution Through Network Science

Niccolo' Di Marco, Edoardo Loru, Alessandro Galeazzi, Matteo Cinelli, Walter Quattrociocchi

详情
英文摘要

Music has always been central to human culture, reflecting and shaping traditions, emotions, and societal changes. Technological advancements have transformed how music is created and consumed, influencing tastes and the music itself. In this study, we use Network Science to analyze musical complexity. Drawing on $\approx20,000$ MIDI files across six macro-genres spanning nearly four centuries, we represent each composition as a weighted directed network to study its structural properties. Our results show that Classical and Jazz compositions have higher complexity and melodic diversity than recently developed genres. However, a temporal analysis reveals a trend toward simplification, with even Classical and Jazz nearing the complexity levels of modern genres. This study highlights how digital tools and streaming platforms shape musical evolution, fostering new genres while driving homogenization and simplicity.

2411.03715 2026-04-27 cs.SD eess.AS 版本更新

MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models

Wen-Chin Huang, Erica Cooper, Tomoki Toda

Comments Accepted to Transactions on Audio, Speech and Language Processing

详情
英文摘要

In this paper, we study the task of subjective speech quality assessment (SSQA), which refers to predicting the perceptual quality of speech. Owing to the development of deep neural network models, SSQA has greatly advanced and has been widely applied in scientific papers to evaluate speech generation systems. Nonetheless, the insufficient out-of-domain (OOD) generalization ability of current SSQA models is underexplored and often overlooked by researchers. To study this problem systematically, we present MOS-Bench, a diverse SSQA dataset collection that currently contains 8 training sets and 17 test sets. Through extensive experiments, we first highlight the OOD generalization challenges of existing models. We then evaluate the efficacy of multiple-dataset training, comparing straightforward data pooling against AlignNet, an existing domain-aware method. We demonstrate that pooling multiple training sets provides a simple yet effective solution, and variation in the data is a key factor for robust generalization beyond training data size.