arXivDaily arXiv每日学术速递 周一至周五更新
2605.15044 2026-05-15 cs.SD cs.AI cs.LG cs.MM eess.AS 版本更新

SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

KiHyun Nam, Jungwoo Heo, Siu Bae, Ha-Jin Yu, Joon Son Chung

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院) University of Seoul(首尔大学)

AI总结 随着物理人工智能、对话机器人和无屏可穿戴设备的发展,音频大语言模型需要具备针对说话人的理解能力,以支持用户认证、个性化和上下文感知交互。为此,本文提出 SpeakerLLM,一种专门针对说话人的音频大语言模型框架,能够统一处理单句说话人画像、录音条件理解、双句说话人对比以及基于证据的验证推理。其核心是采用分层说话人分词器,分别捕捉说话人身份和录音条件的多粒度信息,并通过结构化推理轨迹提升验证推理的准确性和可解释性。

详情
英文摘要

As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must integrate speaker-specific understanding to support user authorization, personalization, and context-aware interaction. This requires modeling who is speaking, how the voice sounds, and how recording conditions affect speaker cues. Conventional speaker verification systems provide strong scalar scores but little linguistic evidence, while current audio-LLMs and speaker-aware language models have limited ability to organize speaker information beyond binary labels or descriptive profiles. We present SpeakerLLM, a speaker-specialized audio-LLM framework that unifies single-utterance speaker profiling, recording-condition understanding, utterance-pair speaker comparison, and evidence-organized verification reasoning within a natural-language interface. We construct verification-reasoning targets and a decision-composition policy that separate profile-level evidence from the final same-or-different decision and organize recording condition, profile evidence, and the decision into a structured trace. At its core, SpeakerLLM uses a hierarchical speaker tokenizer designed to capture multiple granularities of speaker evidence. Utterance-level speaker embeddings summarize identity and profile-level cues, whereas frame-level speaker features preserve fine-grained acoustic descriptors. Experiments show that SpeakerLLM-Base improves speaker-profile and recording-condition understanding over general audio-LLMs, while SpeakerLLM-VR preserves strong generated-verdict accuracy and produces decision traces grounded in the supervised verification reasoning schema. We will release the metadata-enriched supervision dataset and target-construction code for reproducibility.

2508.11845 2026-05-15 cs.SD cs.AI cs.IR cs.LG 版本更新

AVEX: What Matters for Animal Vocalization Encoding

Marius Miron, David Robinson, Milad Alizadeh, Ellen Gilsenan-McMahon, Gagan Narula, Emmanuel Chemla, Maddie Cusimano, Felix Effenberger, Masato Hagiwara, Benjamin Hoffman, Sara Keen, Diane Kim, Jane Lawton, Jen-Yu Liu, Aza Raskin, Olivier Pietquin, Matthieu Geist

发表机构 * Earth Species Project(地球物种项目)

AI总结 本文研究了动物声学编码中影响模型性能的关键因素,旨在开发一个适用于多种下游任务的通用生物声学编码器。通过大规模实验,作者分析了训练数据多样性、模型架构和训练策略对编码器性能的影响,并提出了结合自监督预训练与监督微调的混合训练方法,显著提升了模型在不同任务和数据集上的表现。研究还发现,数据多样性在训练和评估阶段都至关重要,并公开了模型参数以支持后续研究与应用。

Comments In The Fourteenth International Conference on Learning Representations 2026

详情
英文摘要

Bioacoustics, the study of sounds produced by living organisms, plays a vital role in conservation, biodiversity monitoring, and behavioral studies. Many tasks in this field, such as species, individual, and behavior classification and detection, are well-suited to machine learning. However, they often suffer from limited annotated data, highlighting the need for a general-purpose bioacoustic encoder capable of extracting useful representations for diverse downstream tasks. Such encoders have been proposed before, but are often limited in scope due to a focus on a narrow range of species (typically birds), and a reliance on a single model architecture or training paradigm. Moreover, they are usually evaluated on a small set of tasks and datasets. In this work, we present a large-scale empirical study that covers aspects of bioacoustics that are relevant to research but have previously been scarcely considered: training data diversity and scale, model architectures and training recipes, and the breadth of evaluation tasks and datasets. We obtain encoders that are state-of-the-art on the existing and proposed benchmarks. We also identify what matters for training these encoders, such that this work can be extended when more data are available or better architectures are proposed. Specifically, across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery, we find self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance. We show the importance of data diversity in both stages. To support ongoing research and application, we will release the model checkpoints.

2605.14896 2026-05-15 cs.SD cs.LG 版本更新

Text-Dependent Speaker Verification (TdSV) Challenge 2024: Team Naive System Report

Amir Mohammad Rostami, Pourya Jafarzadeh

发表机构 * Self-Organized and Independent Participants(自组织和独立参与者)

AI总结 本文介绍了2024年文本依赖说话人验证(TdSV)挑战赛中“Naive”团队的系统方案。该系统基于现有的先进神经网络ResNet-TDNN和NeXt-TDNN进行适配,并设计了轻量高效的EfficientNet-A0模型,结合数据增强和优化的超参数,实现了优异的验证性能,取得了0.0461的最小检测代价函数(MinDCF)和1.3%的等错误率(EER)。研究展示了多模型集成学习在说话人和短语验证中的有效性。

详情
英文摘要

This paper presents a system for the 2024 Text-Dependent Speaker Verification (TdSV) Challenge. The system achieved a Minimum Detection Cost Function (MinDCF) of 0.0461 and an Equal Error Rate (EER) of 1.3\%. Our approach focused on adapting existing state-of-the-art neural networks, ResNet-TDNN and NeXt-TDNN, originally trained on the VoxCeleb dataset. This strategy was chosen because of the limited challenge duration and the available resources at the time. In addition, we designed a lightweight and resource-efficient model, EfficientNet-A0, trained specifically on the challenge dataset to improve adaptation and strengthen the ensemble approach. Our system combines advanced neural architectures, extensive data augmentation, and optimised hyperparameters. These components helped achieve strong performance in text-dependent speaker verification. The results also demonstrate the effectiveness of multi-model ensemble learning for both speaker and phrase verification.

2605.14888 2026-05-15 cs.SD cs.LG 版本更新

PROCESS-2: A Benchmark Speech Corpus for Early Cognitive Impairment Detection

Madhurananda Pahar, Caitlin H. Illingworth, Bahman Mirheidari, Hend Elghazaly, Fritz Peters, Sophie Young, Wing-Zin Leung, Labhpreet Kaur, Daniel Blackburn, Heidi Christensen

发表机构 * School of Computer Science, University of Sheffield(谢菲尔德大学计算机科学学院) Sheffield Institute for Translational Neuroscience (SITraN), University of Sheffield(谢菲尔德转化神经科学研究所)

AI总结 PROCESS-2 是一个用于早期认知障碍检测的大型语音数据集,旨在支持基于自发和任务导向语音的自动认知评估研究。该数据集包含200名健康受试者、150名轻度认知障碍患者和50名痴呆患者的语音记录,共计约21小时,涵盖图片描述和语言流畅性任务,并附有手动验证的文本和元数据。PROCESS-2 通过严格的临床验证和分区设计,确保了数据的可靠性与实用性,为相关研究提供了可复现的基准资源。

详情
英文摘要

Speech-based analysis offers a scalable and non-invasive approach for detecting cognitive decline, yet progress has been constrained by the limited availability of clinically validated datasets collected under realistic conditions. We introduce PROCESS-2, a large-scale speech dataset designed to support research on automatic assessment of cognitive impairment from spontaneous and task-oriented speech. The dataset comprises recordings from 200 healthy controls, 150 mild cognitive impairment, and 50 dementia diagnoses collected using the CognoMemory digital assessment platform. Each participant completed a single assessment session, including picture description and verbal fluency tasks, accompanied by manually verified transcripts and participant-level metadata. PROCESS-2 contains approximately 21 hours of speech audio with predefined train/test partitions. Comprehensive technical validation evaluated demographic balance, clinical consistency, recording stability, embedding-space structure, and reproducible baseline modelling performance, demonstrating clinically meaningful group separation and stable performance across modelling approaches while preserving real-world conversational variability. PROCESS-2 is released under controlled access via Hugging Face to enable responsible reuse while protecting participant privacy, providing a reproducible benchmark resource for speech-based cognitive assessment research.

2605.14765 2026-05-15 cs.SD cs.CL 版本更新

Persian MusicGen: A Large-Scale Dataset and Culturally-Aware Generative Model for Persian Music

Mohammad Hossein Sameti, Diba Hadi Esfangereh, Sepehr Harfi Moridani, Leili Javidpour, Mahdieh Soleymani Baghshah

发表机构 * Sharif University of Technology(谢尔国立技术大学) Independent Researcher(独立研究员)

AI总结 该研究针对波斯音乐生成模型缺乏的问题,构建了一个包含900多小时高质量音频的波斯音乐大规模数据集,涵盖流行、传统和现代等多种风格。基于该数据集对先进的生成模型MusicGen进行微调,使其更符合波斯音乐的调式、节奏和文化特点,并通过主客观指标评估其性能。该工作为波斯音乐生成研究提供了新资源,展示了音乐生成模型在适应非主流文化语境中的潜力。

Comments 9 pages, 2 figures, 3 tables

详情
英文摘要

Persian music, with its unique tonalities, modal systems (Dastgah), and rhythmic structures, presents significant challenges for music generation models trained primarily on Western music. We address this gap by curating the first large-scale dataset of Persian songs, comprising over 900 hours high-quality audio samples across diverse sub-genres, including pop, traditional, and contemporary styles. This dataset captures the rich melodic and cultural diversity of Persian music and serves as the foundation for fine-tuning MusicGen, a state-of-the-art generative music model. We adapt MusicGen to this domain and evaluate its performance by utilizing subjective and objective metrics. To assess the semantic alignment between generated music and intended style tags, we report the proportion of relevant tags accurately reflected in the generated outputs. Our results demonstrate that the fine-tuned model produces compositions that more align with Persian stylistic conventions. This work introduces a new resource for generative music research and illustrates the adaptability of music generation models to underrepresented cultural and linguistic contexts.

2605.14731 2026-05-15 cs.GR cs.CV cs.SD 版本更新

UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars

Xiaoyu Zhan, Xinyu Fu, Chenghao Yang, Xiaohong Zhang, Dongjie Fu, Pengcheng Fang, Tengjiao Sun, Xiaohao Cai, Hansung Kim, Yuanqi Li, Jie Guo, Yanwen Guo

发表机构 * Nanjing University(南京大学) Mogo AI Ltd.(Mogo AI有限公司) University of Southampton(南安普顿大学)

AI总结 本文提出了一种统一的稀疏运动建模方法UMo,用于实现高保真、实时的共语义数字人动画生成。UMo通过统一处理文本、音频和运动信息,结合空间稀疏的专家混合框架和时间稀疏的关键帧设计,实现了高效实时的密集重建,能够在保证时间一致性和高保真度的同时提升生成质量。此外,UMo采用多阶段训练策略和针对性的音频增强方法,有效提升了语音-运动对齐的精度和语义一致性,为实时共语义动画提供了实用的解决方案。

详情
英文摘要

Speech-driven gestures and facial animations are fundamental to expressive digital avatars in games, virtual production, and interactive media. However, existing methods are either limited to a single modality for audio motion alignment, failing to fully utilize the potential of massive human motion data, or are constrained by the representation ability and throughput of multimodal models, which makes it difficult to achieve high-quality motion generation or real-time performance. We present UMo, a unified sparse motion modeling architecture for real-time co-speech avatars, which processes text, audio, and motion tokens within a unified formulation. Leveraging a spatially sparse Mixture-of-Experts framework and a temporally sparse, keyframe-centric design, UMo efficiently performs real-time dense reconstruction, enabling temporally coherent and high-fidelity animation generation for both facial expressions and gestures. Furthermore, we implement a multi-stage training strategy with targeted audio augmentation to enhance acoustic diversity and semantic consistency. Consequently, UMo preserves fine-grained speech-motion alignment even under strict latency constraints. Extensive quantitative and qualitative evaluations show that UMo achieves better output quality under low latency and real-time performance constraints, offering a practical solution for high-fidelity real-time co-speech avatars.

2605.14555 2026-05-15 cs.SD cs.AI 版本更新

Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis

Shuyang Cui, Zhi Zhong, Qiyu Wu, Zachary Novack, Woosung Choi, Keisuke Toyama, Kin Wai Cheuk, Junghyun Koo, Yukara Ikemiya, Christian Simon, Chihiro Nagashima, Shusuke Takahashi

发表机构 * Sony Group Corporation(索尼集团公司) Sony AI(索尼人工智能)

AI总结 本文提出了一种名为“Break-the-Beat!”的可控MIDI到鼓音效合成模型,旨在解决数字音乐制作中鼓循环音频生成缺乏精细控制的问题。该模型通过引入内容编码器和混合条件机制,对预训练的文本到音频模型进行微调,实现了根据参考音频生成具有特定音色的鼓音效。实验表明,该方法在音频质量、节奏对齐和节拍连贯性方面表现优异,为音乐制作人提供了一种高效、可控的创作工具。

详情
英文摘要

Current methods for creating drum loop audio in digital music production, such as using one-shot samples or resampling, often demand non-trivial efforts of creators. While recent generative models achieve high fidelity and adhere to text, they lack the specific control needed for such a task. Existing symbolic-to-audio research often focuses on single, tonal instruments, leaving the challenge of polyphonic, percussive drum synthesis unaddressed. We address this gap by introducing ``Break-the-Beat!,'' a model capable of rendering a drum MIDI with the timbre of a reference audio. It is built by fine-tuning a pre-trained text-to-audio model with our proposed content encoder and a effective hybrid conditioning mechanism. To enable this, we construct a new dataset of paired target-reference drum audio from existing drum audio datasets. Experiments demonstrate that our model generates high-quality drum audio that follows high-resolution drum MIDI, achieving strong performance across metrics of audio quality, rhythmic alignment, and beat continuity. This offer producers a new, controllable tool for creative production. Demo page: https://ik4sumii.github.io/break-the-beat/

2605.14500 2026-05-15 cs.SD cs.HC eess.IV 版本更新

Physics-Based iOCT Sonification for Real-time Interaction Awareness in Subretinal Injection

Luis D. Reyes Vargas, Veronica Ruozzi, Andrea K. M. Ross, Shervin Dehghani, Michael Sommersperger, Koorosh Faridpooya, Mohammad Ali Nasseri, Merle Fairhurst, Nassir Navab, Sasan Matinfar

发表机构 * Computer Aided Medical Procedures(计算机辅助医疗程序) TUM Klinikum Rechts der Isar(TUM 右岸医院) Rotterdam Eye Hospital(鹿特丹眼科医院) Centre for Tactile Internet with Human-in-the-Loop(人机交互触觉互联网中心) Munich Center for Machine Learning(慕尼黑机器学习中心) Chair for Social Affective Touch(社会情感触觉 chair)

AI总结 本文提出了一种基于物理模型的实时iOCT声学反馈框架,用于提高视网膜下注射手术中的实时交互感知。该方法通过将iOCT获取的视网膜层信息映射为声音反馈,使外科医生能够通过听觉感知针头位置和视网膜形变,从而减轻视觉负担并提升手术精度。实验表明,该方法在视网膜层识别和形变检测方面显著优于现有方法,具有重要的临床应用潜力。

详情
英文摘要

Subretinal injection is a delicate vitreoretinal procedure requiring precise needle placement within the subretinal space while avoiding perforation of the retinal pigment epithelium (RPE), a layer directly beneath the target with extremely limited regenerative capacity. To enhance depth perception during cannula advancement, intraoperative optical coherence tomography (iOCT) offers high-resolution cross-sectional visualization of needle-tissue interaction; however, interpreting these images requires sustained visual attention alongside the en face microscope view, thereby increasing cognitive load during critical phases and placing additional demands on the surgeon's proprioceptive control. In this paper, we propose a structured, real-time sonification framework designed for extensible mapping of iOCT-derived anatomical features into perceptual auditory feedback. The method employs a physics-inspired acoustic model driven by segmented retinal layers from a stream of iOCT B-scans, with needle motion and injection-induced retinal layer displacements serving as excitation inputs to the sound model, enabling perception of tool position and retinal deformation. In a controlled user study (n=34), the proposed sonification achieved high retinal layer identification accuracy and robust detection of retinal deformation-related events, significantly outperforming a state-of-the-art baseline in overall event identification (83.4% vs. 60.6%, p < 0.001), with gains driven primarily by enhanced detection of injection-induced retinal deformation. Evaluation by experts (n=4) confirmed the clinical relevance and potential intraoperative applicability of the method. These results establish structured iOCT sonification as a viable complementary modality for real-time surgical guidance in subretinal injection.

2605.14427 2026-05-15 cs.CL cs.SD 版本更新

A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

Sunil Kumar Kopparapu

发表机构 * TCS Research(TCS研究)

AI总结 本文提出了一种基于微积分的框架,用于确定端到端自动语音识别(ASR)系统中的词汇量大小。该方法通过拟合训练数据,并利用一阶和二阶导数测试原理,正式估计词汇量这一关键超参数。实验表明,该方法在标准Librispeech语料库上有效,能够优化词汇量选择,从而提升ASR系统的性能。本文的主要贡献在于为端到端ASR系统提供了确定词汇量大小的系统化方法。

Comments 8 pages, is an extension of the paper S. K. Kopparapu and A. Panda, A cost minimization approach to fix the vocabulary size in a tokenizer for an end-to-end ASR system, in Proceedings of the 2024 International Conference on Pattern Recognition, Kolkata, India, 2024

详情
英文摘要

In hybrid automatic speech recognition (ASR) systems, the vocabulary size is unambiguous, typically determined by the number of phones, bi-phones, or tri-phones present in the language. In contrast, end-to-end ASR systems derive their vocabulary, often referred to as tokens from the text corpus used for training. The choice and, more importantly, the size of this vocabulary is a critical hyper-parameter in training end-to-end ASR systems. Tokenization algorithms such as Byte Pair Encoding (BPE), WordPiece, and Unigram Language Model (ULM) use the vocabulary size as an input hyper-parameter to generate the sub-words employed during ASR training. Popular toolkits like ESPNet provide a fixed vocabulary size in their training recipes, but there is little documentation or discussion in the literature regarding how these values are determined. Recent work [1] has formalized an approach to identify the vocabulary size best suited for end-to-end ASR, introducing a cost function framework that treats the tokenization process as a black box. In this paper, we build upon that foundation by curve fitting the training data and using the principle of first and second derivative tests in calculus to formally estimate the vocabulary size hyper-parameter. We demonstrate the utility and usefulness of our approach by applying it on a standard Librispeech corpus and show that the optimal choice of vocabulary size hyper-parameter improves the performance of the ASR. The main contribution of this paper in formalizing an approach to identify the vocabulary size best suited for training an end-to-end ASR system.

2605.14340 2026-05-15 cs.SD 版本更新

Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR

Ryo Magoshi, Takashi Maekaku, Yusuke Shinohara

发表机构 * Kyoto University, Japan(京都大学,日本) LY Corporation, Japan(LY公司,日本)

AI总结 基于大语言模型(LLM)的自动语音识别系统通过连接音频编码器和LLM取得了良好性能,但在面对新领域时,由于缺乏配对的语音和文本数据,其适应能力受到限制。本文提出一种新的框架,通过显式建模语音与文本的对齐关系,生成更具表现力的伪音频提示,从而有效弥合模态间的差距,提升目标领域的适应效果。实验表明,该方法在整体错误率和未登录词覆盖率方面均优于现有纯文本适应方法。

Comments Submitted to Interspeech 2026

详情
英文摘要

LLM-based automatic speech recognition models demonstrate strong performance by connecting audio encoders and LLMs. However, data scarcity of paired speech and transcription often hinders their adaptation to new domains, making text-only domain adaptation crucial. Existing methods typically rely on either fine-tuning the LLM alone or employing pseudo-audio prompts. The former neglects essential acoustic context, while the latter either suffers from limited scalability in data-scarce conditions, or yields inexpressive prompts by leveraging only textual features, ignoring audio modality. To address this, we propose an enhanced framework that explicitly models speech-text alignment. Our method efficiently generates highly expressive pseudo-audio prompts that bridges the modality gap, enabling effective target-domain adaptation. Experiments demonstrate that our approach outperforms existing text-only methods, improving both overall error rates and out-of-vocabulary coverage.

2605.14231 2026-05-15 cs.LG cs.AI cs.SD 版本更新

AudioMosaic: Contrastive Masked Audio Representation Learning

Hanxun Huang, Qizhou Wang, Xingjun Ma, Cihang Xie, Christopher Leckie, Sarah Erfani

发表机构 * School of Computing and Information Systems, The University of Melbourne, Australia(墨尔本大学计算机与信息系统学院) Baskin School of Engineering, University of California, Santa Cruz, USA(加州大学圣克鲁兹分校工程学院) Institute of Trustworthy Embodied AI, Fudan University, China(复旦大学可信具身人工智能研究所)

AI总结 本文提出了一种基于对比学习的音频编码器 AudioMosaic,用于通用音频理解任务。该方法通过结构化时频掩码生成正样本对,降低内存消耗并支持高效的大批量训练。与生成式方法相比,AudioMosaic 能够学习更具判别性的语句级表示,在不同数据集、领域和声学条件下表现出优异的迁移能力,并在多个标准音频基准测试中取得了最先进的性能。

Comments ICML2026

详情
英文摘要

Audio self-supervised learning (SSL) aims to learn general-purpose representations from large-scale unlabeled audio data. While recent advances have been driven mainly by generative reconstruction objectives, contrastive approaches remain less explored, partly due to the difficulty of designing effective audio augmentations and the large batch sizes required for contrastive pre-training. We introduce \textbf{AudioMosaic}, a contrastive learning-based audio encoder for general audio understanding. During pre-training, AudioMosaic constructs positive pairs by applying structured time-frequency masking to spectrogram patches, which reduces memory usage and enables efficient large-batch training. Compared with generative approaches, the AudioMosaic encoder learns more discriminative utterance-level representations that demonstrate strong transferability across datasets, domains, and acoustic conditions. Extensive experiments show that AudioMosaic achieves state-of-the-art performance on several standard audio benchmarks under both linear probing and fine-tuning. We further show that integrating the pretrained AudioMosaic encoder into audio-language models improves performance on audio-language tasks. The code is publicly available in our \href{https://github.com/HanxunH/AudioMosaic}{GitHub repository}.

2605.12534 2026-05-15 cs.SD cs.LG q-bio.NC 版本更新

BioSEN: A Bio-acoustic Signal Enhancement Network for Animal Vocalizations

Tianyu Song, Ton Viet Ta, Ngamta Thamwattana, Hisako Nomura, Linh Thi Hoai Nguyen

发表机构 * Graduate School of Bioresource and Bioenvironmental Science, Kyushu University(九州大学生物资源与生物环境科学研究生院) Faculty of Agriculture, Kyushu University(九州大学农学部) School of Information and Physical Sciences, University of Newcastle(新castle大学信息与物理科学学院) International Institute for Carbon-Neutral Energy Research, Kyushu University(九州大学国际碳中性能源研究所)

AI总结 本文提出了一种名为BioSEN的生物声学信号增强网络,旨在解决动物声音在噪声环境下增强的问题。该模型结合了语音增强方法,并针对动物声音的特点设计了三个核心模块,分别用于时频特征提取、谐波结构捕捉和能量自适应门控连接。实验结果表明,BioSEN在三个生物声学数据集上表现优异,计算量远低于现有先进模型,展示了其在生物多样性监测与保护中的应用潜力。

Journal ref ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

详情
英文摘要

Most work in audio enhancement targets human speech, while bioacoustics is less studied due to noisy recordings and the distinct traits of animal sounds. To fill this gap, we adapt speech enhancement methods and build BioSEN, a model made for bioacoustic signals. BioSEN has three modules: a multi-scale dual-axis attention unit for time-frequency feature extraction, a bio-harmonic multi-scale enhancement unit for capturing harmonic structures, and an energy-adaptive gating connection unit that uses frequency weights to keep vocalizations from being removed as noise. Tests on three bioacoustic datasets show that BioSEN matches or exceeds state-of-the-art speech enhancement models while using far less computation. These results show BioSEN's strength for bioacoustic audio enhancement and its promise for biodiversity monitoring and conservation.

2603.29097 2026-05-15 eess.AS cs.SD 版本更新

Asymmetric Encoder-Decoder Based on Time-Frequency Correlation for Speech Separation

Ui-Hyeop Shin, Hyung-Min Park

发表机构 * Department of Electronic Engineering, Sogang University(电子工程系,首尔大学)

AI总结 本文研究了在真实声学环境下如何有效分离混叠语音信号的问题,提出了一种基于时频相关性的不对称编码-解码框架SR-CorrNet。该方法通过引入分离-重建策略,结合时频双路径结构,实现了对说话人特征的逐步细化提取,并利用结构化的相关性到滤波估计方法提升分离效果。实验表明,该方法在多种数据集和不同环境条件下均取得了显著的性能提升。

Comments Submitted to IEEE Transactions on Audio, Speech, and Language Processing (TASLPRO) Code: https://github.com/dmlguq456/SR_CorrNet

详情
英文摘要

Speech separation in realistic acoustic environments remains challenging because overlapping speakers, background noise, and reverberation must be resolved simultaneously. Although recent time-frequency (TF) domain models have shown strong performance, most still rely on late-split architectures, where speaker disentanglement is deferred to the final stage, creating an information bottleneck and weakening discriminability under adverse conditions. To address this issue, we propose SR-CorrNet, an asymmetric encoder-decoder framework that introduces the separation-reconstruction (SepRe) strategy into a TF dual-path backbone. The encoder performs coarse separation from mixture observations, while the weight-shared decoder progressively reconstructs speaker-discriminative features with cross-speaker interaction, enabling stage-wise refinement. To complement this architecture, we formulate speech separation as a structured correlation-to-filter problem: spatio-spectro-temporal correlations computed from the observations are used as input features, and the corresponding deep filters are estimated to recover target signals. We further incorporate an attractor-based dynamic split module to adapt the number of output streams to the actual speaker configuration. Experimental results on WSJ0-{2,3,4,5}Mix, WHAMR!, and LibriCSS demonstrate consistent improvements across anechoic, noisy-reverberant, and real-recorded conditions in both single- and multi-channel settings, highlighting the effectiveness of TF-domain SepRe with correlation-based filter estimation for speech separation.

2603.11042 2026-05-15 cs.CV cs.AI cs.LG cs.MM cs.SD 版本更新

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

Yan-Bo Lin, Jonah Casebeer, Long Mai, Aniruddha Mahapatra, Gedas Bertasius, Nicholas J. Bryan

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) Adobe Research(Adobe研究院)

AI总结 本文提出了一种名为V2M-Zero的视频到音乐生成方法,能够在无需视频-音乐配对数据的情况下生成与视频事件时间对齐的音乐。该方法通过分别提取音乐和视频的事件曲线,捕捉各自模态中的时间结构变化,从而实现跨模态的时间同步。实验表明,V2M-Zero在多个基准数据集上取得了优于现有方法的性能,尤其在时间同步和语义对齐方面表现突出,并且实现了时间与音乐风格的独立控制。

Comments Project page: https://genjib.github.io/v2m_zero/

详情
英文摘要

Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-ZERO, a video-to-music generation approach that generates time-aligned music with disentangled time synchronization and semantic control (e.g., genre, mood) from video while requiring zero video-music pairs at training time. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-ZERO achieves state-of-the-art performance without any paired music-video data, surpassing the strongest prior baselines per metric with 5-9% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Our results validate that temporal alignment through within-modality features is not only effective for video-to-music generation but also leads to better performance than paired cross-modal supervision. Furthermore, our approach enables independent controls for timing and music style (e.g., genre, mood) for more controllable generation.

2602.03680 2026-05-15 physics.soc-ph cs.SD 版本更新

Instantaneous Spectra Analysis of Pulse Series -- Application to Lung Sounds with Abnormalities

Fumihiko Ishiyama

发表机构 * NTT Inc.(日本电通公司)

AI总结 本文研究了脉冲序列的瞬时频谱分析方法,并将其应用于异常肺音(如爆裂音和哮鸣音)及正常肺音的分析。传统傅里叶分析的时间频率分辨率受限于周期边界条件假设,作者提出采用线性外推条件替代该假设,从而实现更精确的瞬时频谱分析。该方法能够有效提取脉冲序列中每个脉冲的频谱信息,并生成脉冲序列的时频图,清晰展示其时间频率结构,为异常肺音的识别提供了新的分析工具。

Comments 10 pages, 7 figures. To appear Proc. IEEE CSPA 2026

详情
英文摘要

The origin of the "theoretical limit of time-frequency resolution of Fourier analysis" is from its numerical implementation, especially from an assumption of "Periodic Boundary Condition (PBC)," which was introduced a century ago. We previously proposed to replace this condition with "Linear eXtrapolation Condition (LXC)," which does not require periodicity. This feature makes instantaneous spectra analysis of pulse series available, which replaces the short time Fourier transform (STFT). We applied the instantaneous spectra analysis to two lung sounds with abnormalities (crackles and wheezing) and to a normal lung sound, as a demonstration. Among them, crackles contains a random pulse series. The spectrum of each pulse is available, and the spectrogram of pulse series is available with assembling each spectrum. As a result, the time-frequency structure of given pulse series is visualized.

2512.03637 2026-05-15 cs.SD cs.LG stat.ML 版本更新

AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers

Kohei Yamamoto, Kosuke Okusa

发表机构 * Research & Development Center, Technology Division, Oki Electric Industry Co., Ltd.(oki电产业株式会社研发中心,技术部门) Department of Data Science for Business Innovation, Chuo University(中央大学商务创新数据科学系)

AI总结 该研究提出了一种名为AaSP的音频频谱图Transformer自监督预训练框架,旨在解决传统方法中因时间下采样导致的混叠问题。AaSP通过引入感知混叠的补丁表示、教师-学生掩码建模、跨注意力预测器以及多掩码对比正则化,学习能够整合易受混叠影响频段特征且在不同掩码视图下保持稳定的音频表示。实验表明,AaSP在多个音频识别任务中表现出色,优于现有自监督方法。

Comments Accepted for publication in IEEE Transactions on Audio, Speech and Language Processing (TALSP). Copyright IEEE

详情
英文摘要

Transformer-based audio self-supervised learning (SSL) models commonly use spectrograms, vision-style Transformers, and masked modeling objectives. However, convolutional patchification with temporal downsampling lowers the effective Nyquist frequency and introduces aliasing, while naïve low-pass filtering may remove task-relevant high-frequency cues. We present AaSP, an aliasing-aware self-supervised pre-training framework for audio spectrogram transformers. AaSP combines an aliasing-aware patch representation, teacher-student masked modeling, a cross-attention predictor, and multi-mask contrastive regularization to learn representations that integrate features from alias-prone modulation bands while remaining stable across masked views. Its patch-embedding module, Aliasing-aware Patch Embedding (AaPE), augments standard patch tokens with features from alias-prone modulation bands using a band-limited complex sinusoidal kernel with a two-sided exponential window. The kernel's frequency and decay parameters are estimated from the input, enabling adaptive subband analysis whose outputs are fused with standard patch tokens. We pre-train on AudioSet and evaluate the learned representations by fine-tuning and linear evaluation on acoustic/environmental, speech, and music recognition benchmarks. Under fine-tuning, the full AaSP framework achieves state-of-the-art results on AS-20K, ESC-50, and NSynth among compared self-supervised baselines, while remaining competitive elsewhere. Linear evaluation shows a similar trend, including gains on US8K and NSynth. Overall, AaSP learns representations that are more stable under aliasing-sensitive temporal perturbations and competitive for downstream transfer.

2511.21247 2026-05-15 eess.AS cs.LG cs.SD 版本更新

The Spheres Dataset: Multitrack Orchestral Recordings for Music Source Separation and Information Retrieval

Jaime Garcia-Martinez, David Diaz-Guerra, John Anderson, Ricardo Falcon-Perez, Pablo Cabañas-Molero, Tuomas Virtanen, Julio J. Carabias-Orti, Pedro Vera-Candeas

发表机构 * Universidad de Jaén(耶鲁大学) Odratek BV(Odratek公司) Tampere University(塔尔库大学)

AI总结 本文介绍了《Spheres数据集》,这是一个包含多轨管弦乐录音的数据集,旨在推动经典音乐领域中音乐源分离及相关音乐信息检索任务的机器学习研究。数据集由Colibrì乐团在The Spheres录音棚演奏的超过一小时的音乐作品组成,包括柴可夫斯基《罗密欧与朱丽叶》和莫扎特第四十号交响曲,并附有各乐器的音阶和独奏片段。通过23个麦克风的多角度录制,该数据集提供了真实立体声混音、可控的音轨混入以及独立音轨,适用于源分离模型的训练与评估,并附有各乐器位置的房间脉冲响应,为研究提供了丰富的声学特性信息。

Journal ref in IEEE Transactions on Audio, Speech and Language Processing, vol. 34, pp. 2622-2634, 2026

详情
英文摘要

This paper introduces The Spheres dataset, multitrack orchestral recordings designed to advance machine learning research in music source separation and related MIR tasks within the classical music domain. The dataset is composed of over one hour recordings of musical pieces performed by the Colibrì Ensemble at The Spheres recording studio, capturing two canonical works - Tchaikovsky's Romeo and Juliet and Mozart's Symphony No. 40 - along with chromatic scales and solo excerpts for each instrument. The recording setup employed 23 microphones, including close spot, main, and ambient microphones, enabling the creation of realistic stereo mixes with controlled bleeding and providing isolated stems for supervised training of source separation models. In addition, room impulse responses were estimated for each instrument position, offering valuable acoustic characterization of the recording space. We present the dataset structure, acoustic analysis, and baseline evaluations using X-UMX based models for orchestral family separation and microphone debleeding. Results highlight both the potential and the challenges of source separation in complex orchestral scenarios, underscoring the dataset's value for benchmarking and for exploring new approaches to separation, localization, dereverberation, and immersive rendering of classical music.

2605.14066 2026-05-15 eess.AS cs.AI cs.CL cs.SD 版本更新

A Benchmark for Early-stage Parkinson's Disease Detection from Speech

Terry Yi Zhong, Cristian Tejedor-Garcia, Khiet P. Truong, Janna Maas, Louis ten Bosch, Bastiaan R. Bloem

发表机构 * Centre for Language Studies, Radboud University, Nijmegen, the Netherlands(语言研究所以及拉德堡德大学,尼姆egen,荷兰) Center of Expertise for Parkinson and Movement Disorders, Radboud University Medical Center, Nijmegen, the Netherlands(帕金森及运动障碍专家研究所,拉德堡德大学医学中心,尼姆egen,荷兰)

AI总结 该研究提出首个用于基于语音的早期帕金森病检测的基准,旨在解决现有研究因数据集、语言、任务和评估方式不同而导致的结果难以比较的问题。该基准采用说话人无关划分,支持在公开数据集上进行公平且可复现的跨方法评估,并涵盖三种常见语音任务,同时在不同训练资源条件下对方法进行测试。研究还提供了多维度的评估分析,助力细粒度比较与临床应用,为推动鲁棒且具有临床意义的早期帕金森病检测提供了可复用的参考。

Comments Submitted to Interspeech2026

详情
英文摘要

Early-stage Parkinson's disease (EarlyPD) detection from speech is clinically meaningful yet underexplored, and published results are hard to compare because studies differ in datasets, languages, tasks, evaluation protocols, and EarlyPD definitions. To address this issue, we propose the first benchmark for speech-based EarlyPD detection, with a speaker-independent split designed for fair and replicable cross-method evaluation on researcher-accessible datasets. The benchmark covers three common speech tasks and evaluates methods under different training-resource settings. We also present multi-dimensional evaluation breakdowns by dataset, aggregation level, gender, and disease stage to support fine-grained comparisons and clinical adoption. Our results provide a replicable reference and actionable insights, encouraging the adoption of this publicly available benchmark to advance robust and clinically meaningful EarlyPD detection from speech.

2605.14031 2026-05-15 cs.SD cs.CV cs.LG 版本更新

Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study

Wuao Liu, Mustafa Chasmai, Subhransu Maji, Grant Van Horn

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 本文研究了在有限数据条件下,掩码自编码器(MAE)在生物声学细粒度物种分类任务中的有效性。通过在iNatSounds数据集上的系统实验,分析了预训练数据规模、领域特异性、数据筛选和迁移策略等因素的影响。研究发现,使用多样化通用音频数据预训练的模型在生物声学任务中表现最佳,而针对特定领域的额外预训练和数据筛选在小规模数据下效果有限,甚至可能降低性能。结果表明,在中等规模的细粒度生物声学场景中,预训练数据的规模比目标函数设计对模型性能影响更大。

Comments Workshop on Fine-Grained Visual Categorization (FGVC) at CVPR 2026. 8 pages, 6 figures

详情
英文摘要

Bioacoustic recognition requires fine-grained acoustic understanding to distinguish similar-sounding species. However, many large-scale data repositories such as iNaturalist are weakly annotated, often with only a single positive species label per recording, making supervised learning particularly challenging. Inspired by advances in computer vision, recent approaches have shifted toward self-supervised learning to capture the underlying structure of audio without relying on exhaustive annotations. In particular, masked autoencoders (MAE) have shown strong transferability on massive audio corpora, yet their effectiveness in more modest bioacoustic settings remains underexplored. In this work, we conduct a systematic study of MAE pretraining for species classification on iNatSounds, analyzing the impacts of pretraining data scale, domain specificity, data curation, and transfer strategies. Consistent with prior work, we find that models pretrained on diverse general audio data achieve the best transfer performance on iNatSounds. Contrary to observations from large-scale audio benchmarks, we find that (1) additional masked reconstruction pretraining on domain-specific data provides limited benefits and may even degrade performance relative to off-the-shelf models, and (2) selective data filtering offers a negligible advantage when the overall data scale is limited. Our results indicate that, in moderate-sized fine-grained bioacoustic settings, pretraining scale dominates objective design. These findings further clarify when MAE-based pretraining is effective and provide practical guidance for model selection under limited supervision.

2605.14016 2026-05-15 cs.SE cs.SD 版本更新

Case Studies and Reflections on Agentic Software Engineering for Rapid Development of Digital Music Instruments

Matthew John Yee-King

发表机构 * Computing, Goldsmiths(Goldsmiths 计算学系)

AI总结 本文探讨了智能代理软件工程(ASE)在数字音乐乐器软件开发中的应用,旨在降低开发门槛、提升软件互操作性和长期可用性。通过三个案例研究,作者展示了如何利用ASE技术在C++和JUCE框架下开发音频软件,包括重新实现音乐鼠标插件、将Continuator系统从Python移植为原生插件以及开发新的3D音序器界面。研究通过开发者自身经验的叙述,总结了ASE在该领域的有效实践,并提出了未来与非程序员音乐家合作评估该方法的建议。

详情
英文摘要

The article explores the use of agentic software engineering (ASE) in the development of innovative audio software. It begins with a review of background work that lays out the challenges of longevity, interoperability and barriers to entry in digital music instrument creation, explaining recent developments in ASE and highlighting the possibility that ASE can lower barriers to entry and facilitate creation of interoperable software with greater longevity. Following that, we present case studies wherein we used ASE technology in three distinct ways to develop audio software in the C++ language with the JUCE framework. In case study 1, we re-implement Laurie Spiegel's `Music Mouse' software as a native plugin. In case study 2, we translate Pachet's `Continuator' system from Python into a native plugin. In case study 3, we develop a new 3D user interface for an existing `tracker' sequencer using OpenGL. We describe the experiences of the human developer in the case studies via autoethnographic discussion of the prompt logs and snapshots of the software as it was developed. We identify effective practice for ASE use in this domain and suggest future steps for the work involving evaluation of the method with non-programmer musicians.