arXivDaily arXiv每日学术速递 周一至周五更新
2605.06627 2026-05-08 cs.SD cs.LG 版本更新

PianoCoRe: Combined and Refined Piano MIDI Dataset

Ilya Borovik

Comments Published in TISMIR. Project repository: https://github.com/ilya16/PianoCoRe

Journal ref Transactions of the International Society for Music Information Retrieval, 9(1), 144-163, 2026

详情
英文摘要

Symbolic music datasets with matched scores and performances are essential for many music information retrieval (MIR) tasks. Yet, existing resources often cover a narrow range of composers, lack performance variety, omit note-level alignments, or use inconsistent naming formats. This work presents PianoCoRe, a large-scale piano MIDI dataset that unifies and refines major open-source piano corpora. The dataset contains 250,046 performances of 5,625 pieces written by 483 composers, totaling 21,763 h of performed music. PianoCoRe is released in tiered subsets to support different applications: from large-scale analysis and pre-training (PianoCoRe-C and deduplicated PianoCoRe-B) to expressive performance modeling with note-level score alignment (PianoCoRe-A/A*). The note-aligned subset, PianoCoRe-A, provides the largest open-source collection of 157,207 performances aligned to 1,591 scores to date. In addition to the dataset, the contributions are: (1) a MIDI quality classifier for detecting corrupted and score-like transcriptions and (2) RAScoP, an alignment refinement pipeline that cleans temporal alignment errors and interpolates missing notes. The analysis shows that the refinement reduces temporal noise and eliminates tempo outliers. Moreover, an expressive performance rendering model trained on PianoCoRe demonstrates improved robustness to unseen pieces compared to models trained on raw or smaller datasets. PianoCoRe provides a ready-to-use foundation for the next generation of expressive piano performance research.

2605.06035 2026-05-08 cs.SD cs.AI 版本更新

Quantum Kernels for Audio Deepfake Detection Using Spectrogram Patch Features

Lisan Al Amin, Rakib Hossain, Mahbubul Islam, Faisal Quader, Thanh Thi Nguyen

详情
英文摘要

Quantum machine learning has emerged as a promising tool for pattern recognition, yet many audio-focused approaches still treat spectrograms as generic images and do not explicitly exploit their time-frequency structure. We propose Q-Patch, a quantum feature map tailored to audio that encodes local time-frequency patches from mel-spectrograms into quantum states using shallow, hardware-efficient circuits with adjacency-aware entanglement. Each selected patch is summarized by a compact four-dimensional acoustic descriptor and mapped to a four-qubit circuit with depth at most three, enabling practical quantum kernel construction under near-term constraints. We evaluate Q-Patch on an audio spoofing detection task using a controlled, balanced protocol and compare it with size-matched classical baselines. Q-Patch improves discrimination between bona fide and spoofed samples, achieving an area under the receiver operating characteristic curve (AUROC) of 0.87, compared with 0.82 for a radial basis function support vector machine (RBF-SVM) trained on the same patch-level features. Kernel-space analysis further reveals a clear class structure, with cross-class similarity around 0.615 and within-class self-similarity of 1.00. Overall, Q-Patch provides a practical framework for incorporating time-frequency-aware representations into quantum kernel learning for audio authenticity assessment in low-resource settings.

2605.05982 2026-05-08 cs.SD 版本更新

Do Melody and Rhythm Coevolve?

Harin Lee, Rainer Polak, Manuel Anglada-Tort, Marc Schönwiesner, Minsu Park, Nori Jacoby

Comments 6 pages, 3 figures, to be included in Proceedings of the Annual Meeting of the Cognitive Science Society

详情
英文摘要

Music comprises two core structural components, melody and rhythm, that vary widely across cultures. Whether these components coevolve in a coupled way or follow independent trajectories remains unclear. We introduce a novel computational pipeline to extract vocal melodic pitch-interval and percussive inter-onset timing distributions from 27,628 popular songs across 59 countries, enabling large-scale cross-cultural comparison that bypasses traditional music annotations. Musical similarities between countries aligned with geographic and linguistic relationships, validating our approach. Substantial variation emerged in both melodic and rhythmic structures across countries, yet the diversity of the two components was not significantly correlated, challenging assumptions of coupled evolution. Only rhythmic diversity was significantly associated with ethnic and linguistic heterogeneity, while melodic diversity showed no such association. These findings suggest that melody and rhythm constitute partially independent systems shaped by distinct cultural and evolutionary pressures, rather than components of a single monolithic musical style.

2602.12805 2026-05-08 physics.med-ph cs.SD eess.IV 版本更新

A Wavefield Correlation Approach to Improve Sound Speed Estimation in Ultrasound Autofocusing

Louise Zhuang, Samuel Beuret, Ben Frey, Saachi Munot, Walter Simson, Dongwoon Hyun, Jeremy J. Dahl

详情
英文摘要

In pulse-echo ultrasound, aberration often degrades image quality when beamforming does not account for wavefront distortions. To address this issue, local sound speed estimators have been developed in the past decade for distributed aberration correction. Recently, methods based on iterative optimization have improved sound speed accuracy with respect to earlier approaches. However, the accuracy of these newer methods is limited by media with reverberation clutter and by the straight-ray model of wave propagation. To address these challenges, we propose using wavefield correlation (WFC) beamforming when performing sound speed optimization. WFC, an ultrasound adaptation of reverse time migration, correlates simulated forward-propagated transmit wavefields and backwards-propagated receive wavefields in order to reconstruct images. This process more accurately models wave propagation in heterogeneous media and can decrease diffuse clutter due to its spatiotemporal matched filtering effect. We implement herein a WFC beamformer using an auto-differentiation software and estimate the sound speed map by optimizing a regularized common-midpoint phase focusing criterion using gradient descent. This approach is compared to a previous method relying on delay and sum (DAS) with straight-ray time delay calculations on a variety of simulated, phantom, and in vivo data with large sound speed variations and clutter. Results show that using WFC decreases sound speed estimation error, leading to improvements in resolution and contrast in the corrected image. In particular, these promising results have potential to improve pulse-echo imaging for challenging clinical scenarios.

2601.20362 2026-05-08 cs.SD cs.AI 版本更新

Switchcodec: Adaptive residual-expert sparse quantization for high-fidelity neural audio coding

Xiangbo Wang, Wenbin Jiang, Jin Wang, Yubo You, Sheng Fang, Fei Wen

Comments This manuscript contains critical errors in the experimental parameter settings and partial algorithm derivation in Section 3 and Section 4, which will lead to inaccurate conclusion interpretation. We need to withdraw the paper for comprehensive revision, re-calculation and experimental verification, and will resubmit after full correction

详情
英文摘要

Recent neural audio compression models often rely on residual vector quantization for high-fidelity coding, but using a fixed number of per-frame codebooks is suboptimal for the wide variability of audio content-especially for signals that are either very simple or highly complex. To address this limitation, we propose SwitchCodec, a neural audio codec based on Residual Experts Vector Quantization (REVQ). REVQ combines a shared quantizer with dynamically routed expert quantizers that are activated according to the input audio, decoupling bitrate from codebook capacity and improving compression efficiency. This design ensures full training and utilization of each quantizer. In addition, a variable-bitrate mechanism adjusts the number of active expert quantizers at inference, enabling multi-bitrate operation without retraining. Experiments demonstrate that SwitchCodec surpasses existing baselines on both objective metrics and subjective listening tests.

2505.24437 2026-05-08 cs.SD eess.AS 版本更新

SwitchCodec: A High-Fidelity Nerual Audio Codec With Sparse Quantization

Jin Wang, Wenbin Jiang, Xiangbo Wang, Yubo You, Sheng Fang

Comments There is some technical error in this paper's method

详情
英文摘要

Neural audio compression has emerged as a promising technology for efficiently representing speech, music, and general audio. However, existing methods suffer from significant performance degradation at limited bitrates, where the available embedding space is sharply constrained. To address this, we propose a universal high-fidelity neural audio compression algorithm featuring Residual Experts Vector Quantization (REVQ), which substantially expands the embedding space with minimal impact on bandwidth. A gentle load-balancing strategy is introduced to ensure the full utilization of this expanded space. Furthermore, we develop a novel multi-tiered discriminator that periodically stratifies STFT spectra, guiding the generator to focus on critical spectral regions. To support multiple bitrates without quality loss at the lower end, we adopt an efficient post-training strategy. Our proposed model achieves impressive performance, with PESQ and ViSQOL scores of 2.87 and 4.27, respectively, at 2.67 kbps bandwidth. The approach effectively reduces spectral blur, decreasing the distance to the original mel-spectrogram by 13%. Notably, our post-training strategy achieves performance comparable to dedicated fixed-bitrate models while reducing the required training time by half. Extensive ablation studies confirm the superiority of our method over baselines.

2505.01747 2026-05-08 eess.AS cs.SD 版本更新

Low-Complexity Acoustic Scene Classification with Device Information in the DCASE 2025 Challenge

Florian Schmid, Paul Primus, Toni Heittola, Annamaria Mesaros, Irene Martín-Morató, Gerhard Widmer

Comments Task Description Page: https://dcase.community/challenge2025/task-low-complexity-acoustic-scene-classification-with-device-information

详情
英文摘要

This paper presents the Low-Complexity Acoustic Scene Classification with Device Information Task of the DCASE 2025 Challenge, along with its baseline system. Continuing the focus on low-complexity models, data efficiency, and device mismatch from previous editions (2022-2024), this year's task introduces a key change: recording device information is now provided at inference time. This enables the development of device-specific models that leverage device characteristics-reflecting real-world deployment scenarios in which a model is designed with awareness of the underlying hardware. The training set matches the 25% subset used in the corresponding DCASE 2024 challenge, with no restrictions on external data use, highlighting transfer learning as a central topic. The baseline achieves 50.72% accuracy with a device-agnostic model, improving to 51.89% when incorporating device-specific fine-tuning. The task attracted 31 submissions from 12 teams, with 11 teams outperforming the baseline. The top-performing submission achieved an accuracy gain of more than 8 percentage points over the baseline on the evaluation set.

2605.05554 2026-05-08 eess.AS cs.SD 版本更新

Optimal Transport Audio Distance with Learned Riemannian Ground Metrics

Wonwoo Jeong

Comments 21 pages, 4 figures, 10 tables. The otadtk toolkit is available at https://github.com/wonwoo-jeong/otadtk

详情
英文摘要

In audio generation evaluation, Fréchet Audio Distance (FAD) is a 2-Wasserstein distance with structural constraints for both primitives: the cost is a frozen embedding pullback whose invariance set hides severe artifacts, and the coupling is a Gaussian fit that dilutes rank-1 contamination relative to discrete OT. We propose Optimal Transport Audio Distance (OTAD), which corrects each primitive with one dedicated mechanism -- a residual Riemannian ground-metric adapter for the cost and entropic Sinkhorn optimal transport for the coupling. Across eight encoders under a four-axis protocol, coupling-only comparisons at $ε= 0.05$ show that Sinkhorn's rank-1 sensitivity exceeds FAD's by a factor of 1.9 to 3.6. Furthermore, OTAD achieves a higher mean Spearman correlation with audio-quality MOS (DCASE 2023 Task 7) than baseline metrics. As an intrinsic benefit of the discrete transport plan, OTAD yields per-sample diagnostics with AUROC $\ge 0.86$, a capability that scalar- or kernel-aggregated metrics structurally lack.

2605.05231 2026-05-08 eess.AS cs.SD 版本更新

Prompting Whisper for Joint Speech Transcription and Diarization

Mariia Zamyrova, Henk van den Heuvel

Comments To be presented at the Joint Workshop on HSCMA and CHiME 2026

详情
英文摘要

As part of the MediSpeech project, we aim to develop a system that transcribes and diarizes Dutch conversations between doctors and patients in real-time. In this research (in-progress) we explore ways of efficiently combining Whisper with speaker diarization (SD). After trying to prompt Whisper with text that contains speaker labels, we observed that it is able to insert labels into the transcription with promising accuracy. We continued this line of research by fine-tuning Whisper with speaker-labelled prompts to generate transcriptions in a format similar to that of Serialized Output Training (SOT). Fine-tuning Whisper yielded more consistent speaker IDs across the chunks of long-form audio and improved verbatim transcription. The study uncovered new challenges as Whisper's SD performance suffers because of mistakes that get propagated through prompts and inaccurate timestamps assigned to overlapping speech.

2604.20229 2026-05-08 cs.SD cs.AI 版本更新

Enhancing Speaker Verification with Whispered Speech via Post-Processing

Magdalena Gołębiowska, Piotr Syga

Comments 15 pages, 3 figures, conference paper at ACIIDS 2026

详情
英文摘要

Speaker verification is a task of confirming an individual's identity through the analysis of their voice. Whispered speech differs from phonated speech in acoustic characteristics, which degrades the performance of speaker verification systems in real-life scenarios, including avoiding fully phonated speech to protect privacy, disrupt others, or when the lack of full vocalization is dictated by a disease. In this paper we propose a model with a training recipe to obtain more robust representations against whispered speech hindrances. The proposed system employs an encoder--decoder structure built atop a fine-tuned speaker verification backbone, optimized jointly using cosine similarity--based classification and triplet loss. We gain relative improvement of 22.26\% compared to the baseline (baseline 6.77\% vs ours 5.27\%) in normal vs whispered speech trials, achieving AUC of 98.16\%. In tests comparing whispered to whispered, our model attains an EER of 1.88\% with AUC equal to 99.73\%, which represents a 15\% relative enhancement over the prior leading ReDimNet-B2. We also offer a summary of the most popular and state-of-the-art speaker verification models in terms of their performance with whispered speech. Additionally, we evaluate how these models perform under noisy audios, obtaining that generally the same relative level of noise degrades the performance of speaker verification more significantly on whispered speech than on normal speech.

2602.22710 2026-05-08 cs.SD cs.AI cs.HC 版本更新

Same Words, Different Judgments: How Preferences Vary Across Modalities

Aaron Broukhim, Nadir Weibel, Eshin Jolly

Comments Submitted to NeurIPS 2026 for review

详情
英文摘要

Preference-based reinforcement learning (PbRL) is the dominant framework for aligning AI systems to human preferences. However, evaluation protocols for such data were designed for text and have not been validated for speech. We present the first ICC-based, controlled cross-modal study of human and synthetic preference annotations, comparing text and audio evaluations of identical semantic content across 100 prompts. We show that achieving $\textit{good}$ agreement within either modality (ICC(2,$k$) $\approx$ .80) requires $\sim$9 raters. At the same time, modalities show marked differences in how people report preferences: audio raters exhibit narrower decision thresholds, reduced length bias, and more user-oriented evaluation criteria, with near-chance cross-modality agreement. We demonstrate that synthetic ratings can be used to effectively predict inter-rater agreement, thus serving as an early signal for stimulus selection and proxy for human annotations. Together, these findings argue that evaluation protocols for audio preference data require modality-specific design rather than direct adaptation from text.

2601.21264 2026-05-08 cs.HC cs.SD eess.AS 版本更新

Evaluating Spatialized Auditory Cues for Rapid Attention Capture in XR

Yoonsang Kim, Swapnil Dey, Arie Kaufman

Comments 8 pages, 4 figures. This is the author's version of the article that appeared at the IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (IEEE VRW) 2026

详情
英文摘要

In time-critical eXtended reality (XR) scenarios where users must rapidly reorient their attention to hazards, alerts, or instructions while engaged in a primary task, spatial audio can provide an immediate directional cue without occupying visual bandwidth. However, such scenarios can afford only a brief auditory exposure, requiring users to interpret sound direction quickly and without extended listening or head-driven refinement. This paper reports a controlled exploratory study of rapid spatial-audio localization in XR. Using HRTF-rendered broadband stimuli presented from a semi-dense set of directions around the listener, we quantify how accurately users can infer coarse direction from brief audio alone. We further examine the effects of short-term visuo-auditory feedback training as a lightweight calibration mechanism. Our findings show that brief spatial cues can convey coarse directional information, and that even short calibration can improve users' perception of aural signals. While these results highlight the potential of spatial audio for rapid attention guidance, they also show that auditory cues alone may not provide sufficient precision for complex or high-stakes tasks, and that spatial audio may be most effective when complemented by other sensory modalities or visual cues, without relying on head-driven refinement. We leverage this study on spatial audio as a preliminary investigation into a first-stage attention-guidance channel for wearable XR (e.g., VR head-mounted displays and AR smart glasses), and provide design insights on stimulus selection and calibration for time-critical use.