arXivDaily arXiv每日学术速递 周一至周五更新
2605.07903 2026-05-11 cs.SD cs.AI 版本更新

BeeVe: Unsupervised Acoustic State Discovery in Honey Bee Buzzing

Hamze Hammami, Nidhal Abdulaziz

AI总结 本文提出了一种名为BeeVe的无监督框架,用于从蜂群嗡嗡声中发现声学状态,无需依赖预定义的语义单元或声音生成模型。该方法利用冻结的自监督PaSST模型提取特征,并通过VQ-VAE在无标签数据上学习离散的声学编码本,成功区分了蜂群中是否有蜂王的不同状态,并进一步识别出多个稳定的子状态。实验表明,该方法能够有效捕捉声学信号中的非随机序列结构,并在未见过的录音中保持良好的泛化能力,为无创蜂群健康监测提供了新途径。

详情
英文摘要

Discovering structure in biological signals without supervision is a fundamental problem in computational intelligence, yet existing bioacoustic methods assume vocal production models or predefined semantic units, leaving non-vocal species poorly served. This work introduces BeeVe, an unsupervised framework for acoustic state discovery in collective honey bee buzzing. BeeVe uses the self-supervised Patchout Spectrogram Transformer (PaSST) as a frozen feature extractor, then trains a Vector-Quantized Variational Autoencoder (VQ-VAE) without labels on those embeddings, learning a finite discrete codebook of acoustic tokens directly from unlabelled hive audio. No labels, pretext tasks, or contrastive objectives are used at any stage. Post-hoc evaluation against known queen status reveals that the learned tokens separate queenright and queenless conditions with Jensen-Shannon Divergence values between 0.609 and 0.688, and that the queenless condition further decomposes into three internally coherent sub-states stable across experiments with different codebook sizes and random seeds. Token transition analysis confirms non-random sequential structure (p << 0.001) across all experiments. Generalisation to unseen recordings preserves both token overlap (Jaccard = 0.947) and global manifold topology. These results demonstrate that unsupervised discrete codebook learning can recover repeatable acoustic structure from a non-vocal biological signal without annotation, opening a path toward non-invasive acoustic hive health monitoring.

2605.07735 2026-05-11 cs.SD 版本更新

TARNet: A Temporal-Aware Multi-Scale Architecture for Closed-Set Speaker Identification

Yassin Terraf, Youssef Iraqi

AI总结 本文提出了一种名为TARNet的轻量级时序感知多尺度网络,用于闭集说话人识别任务。该方法通过多阶段时序编码器在不同时间尺度上显式建模时序信息,并结合注意力统计池化模块融合多尺度特征,生成具有判别力的说话人嵌入。实验表明,TARNet在VoxCeleb1和LibriSpeech数据集上优于现有先进方法,且计算复杂度较低,适合实际应用。

Comments Accepted at IEEE International Conference on Multimedia and Expo (ICME) 2026. Code available at: https://github.com/YassinTERRAF/TARNet

详情
英文摘要

Closed-Set speaker identification aims to assign a speech utterance to one of a predefined set of enrolled speakers and requires robust modeling of speaker-specific characteristics across multiple temporal scales. While recent deep learning approaches have achieved strong performance, many existing architectures provide limited mechanisms for modeling temporal dependencies across different time scales, which can restrict the effective use of complementary short-, mid-, and long-term speaker characteristics. In this paper, we propose TARNet, a lightweight Temporal-Aware Representation Network for closed-set speaker identification. TARNet explicitly models temporal information at multiple time scales using a multi-stage temporal encoder with stage-specific dilation configurations. The resulting multi-scale representations are fused and aggregated via an Attentive Statistics Pooling (ASP) module to produce a discriminative utterance-level speaker embedding. Experiments on the VoxCeleb1 and LibriSpeech datasets show that TARNet outperforms state-of-the-art methods while maintaining competitive computational complexity, making it suitable for practical speaker identification systems. The code is publicly available at https://github.com/YassinTERRAF/TARNet.

2605.07694 2026-05-11 eess.AS cs.AI cs.SD eess.SP 版本更新

Dependence on Early and Late Reverberation of Single-Channel Speaker Distance Estimation

Michael Neri, Archontis Politis, Tuomas Virtanen

AI总结 本文研究了单通道说话人距离估计模型对房间脉冲响应中早期反射和晚期混响的依赖性。通过将模拟的RIR分解为四种变体,并在不同校准条件下进行评估,发现模型在未进行时间校准时主要依赖早期反射信息,而在时间校准条件下仅通过传播延迟即可实现较高精度的距离估计。研究还表明,早期能量越强、环境混响越弱,估计精度越高。

Comments Submitted to IWAENC 2026

详情
英文摘要

Single-channel speaker distance estimation has recently achieved centimeter-level accuracy in simulated environments, yet it remains unclear which components of the room impulse response (RIR) the model exploits and how performance depends on the recording conditions. In this work, we decompose simulated RIRs into four variants (full, direct-only, no-late, and no-early) using the mixing time estimated from the echo density function as the boundary between early reflections and late reverberation. We define four calibration scenarios, from fully calibrated (synchronised capture, known source level) to fully uncalibrated (arbitrary onset, unknown level), and evaluate all combinations on a matched dataset. Results show that without time calibration, mean absolute error (MAE) increases to $1.29$ m and the model extracts reverberation-based cues, with early reflections emerging as the most informative component. Further analysis against DRR, $C_{50}$, and $T_{60}$ confirms that estimation accuracy improves with stronger early energy and degrades in highly reverberant environments. When time calibration is available, the model achieves a MAE of $0.14$ m by extracting the propagation delay alone, regardless of the RIR content.

2408.07522 2026-05-11 cs.SD cs.LG eess.AS 版本更新

Optimising MFCC parameters for the automatic detection of respiratory diseases

Yuyang Yan, Sami O. Simons, Loes van Bemmel, Lauren Reinders, Frits M. E. Franssen, Visara Urovi

AI总结 该研究探讨了MFCC参数对呼吸道疾病自动检测性能的影响,系统分析了系数数量、帧长和帧移等关键参数的作用。通过四个公开数据集和SVM分类器进行实验,发现MFCC的准确率随帧移增加而下降,最佳系数数量约为30,并揭示了不同数据集对帧长的敏感性差异。研究进一步优化了参数组合,显著提升了分类准确率,最高提升幅度达19.6%。

详情
英文摘要

Voice signals originating from the respiratory tract are utilized as valuable acoustic biomarkers for the diagnosis and assessment of respiratory diseases. Among the employed acoustic features, Mel Frequency Cepstral Coefficients (MFCC) is widely used for automatic analysis, with MFCC extraction commonly relying on default parameters. However, no comprehensive study has systematically investigated the impact of MFCC extraction parameters on respiratory disease diagnosis. In this study, we address this gap by examining the effects of key parameters, namely the number of coefficients, frame length, and hop length between frames, on respiratory condition examination. Our investigation uses four datasets: the Cambridge COVID-19 Sound database, the Coswara dataset, the Saarbrucken Voice Disorders (SVD) database, and a TACTICAS dataset. The Support Vector Machine (SVM) is employed as the classifier, given its widespread adoption and efficacy. Our findings indicate that the accuracy of MFCC decreases as hop length increases, and the optimal number of coefficients is observed to be approximately 30. The performance of MFCC varies with frame length across the datasets: for the COVID-19 datasets (Cambridge COVID-19 Sound database and Coswara dataset), performance declines with longer frame lengths, while for the SVD dataset, performance improves with increasing frame length (from 50 ms to 500 ms). Furthermore, we investigate the optimized combination of these parameters and observe substantial enhancements in accuracy. Compared to the worst combination, the SVM model achieves an accuracy of 81.1%, 80.6%, and 71.7%, with improvements of 19.6%, 16.10%, and 14.90% for the Cambridge COVID-19 Sound database, the Coswara dataset, and the SVD dataset respectively.

2605.07489 2026-05-11 cs.SD cs.MM eess.SP 版本更新

A Decomposed Retrieval-Edit-Rerank Framework for Chord Generation

Qiqi He, Dichucheng Li, Xiaoheng Sun, Anqi Huang

AI总结 该论文提出了一种用于和弦生成的分解式检索-编辑-重排序(RER)框架,旨在解决在保持音乐理论可行性的同时提升风格多样性这一挑战。该方法将生成过程分解为三个明确阶段:检索候选和弦、编辑以确保理论可行性、重排序以优化偏好。通过这种结构化流程,系统实现了更高的可控性和可解释性,并在客观指标和主观评估中优于现有端到端方法。

Comments Accepted by the 2026 ACM International Conference on Multimedia Retrieval (ICMR 2026)

详情
英文摘要

Chord generation is an inherently constrained creative task that requires balancing stylistic diversity with music-theoretic feasibility. Existing approaches typically entangle candidate generation and constraint enforcement within a single model, making the diversity-feasibility trade-off difficult to control and interpret. In this work, we approach chord generation from a system-level perspective, introducing a Retrieval-Edit-Rerank (RER) framework that decomposes the task into three explicit stages: i) retrieval, which defines a stylistically plausible candidate space; ii) editing, which enforces music-theoretic feasibility through minimal modifications; and iii) reranking, which resolves soft preferences among feasible candidates. This separation provides a controllable pipeline, where each component addresses a distinct aspect of the generation process, thereby enhancing both the interpretability and adjustability of the output chords. Through objective metrics and subjective evaluation, our decomposed system outperforms all end-to-end chord generation baselines in balancing chord diversity and music-theoretic feasibility. Ablation studies further confirm the complementary roles of each stage in creative exploration and constraint satisfaction.

2605.06897 2026-05-11 cs.CL cs.AI cs.HC cs.MM cs.SD eess.AS 版本更新

MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

Maximillian Chen, Xuanming Zhang, Michael Peng, Zhou Yu, Alexandros Papangelis, Yohan Jo

AI总结 随着物联网设备的普及,需要能够处理复杂用户交互的语音接口。本文提出MIST,一个基于语音的多模态工具调用数据集,用于智能家居场景中的代码生成任务,旨在解决现实环境中设备状态跟踪、时空约束和混合主动交互等挑战。研究发现,开放权重和闭源大语言模型在MIST任务上表现存在明显差距,且当前先进闭源模型仍有较大提升空间。MIST及其生成框架的发布,为相关研究提供了重要资源。

Comments Project Page: https://billyzhang24kobe.github.io/mist-smarthome/

详情
英文摘要

The rise of Internet of Things (IoT) devices in the physical world necessitates voice-based interfaces capable of handling complex user experiences. While modern Large Language Models (LLMs) already demonstrate strong tool-usage capabilities, modeling real-world IoT devices presents a difficult, understudied challenge which combines modeling spatiotemporal constraints with speech inputs, dynamic state tracking, and mixed-initiative interaction patterns. We introduce MIST (the Multimodal Interactive Speech-based Tool-calling Dataset), a synthetic multi-turn, voice-driven code generation task that operates over IoT devices. We find that there is a significant gap between open- and closed-weight multimodal LLMs on MIST, and that even frontier closed-weight LLMs have substantial headroom. We release MIST and an extensible data generation framework to build related datasets in order to facilitate research on mixed-initiative voice assistants which reason about physical world constraints.

2605.06685 2026-05-11 cs.SD eess.AS stat.AP 版本更新

An audio-to-analysis pipeline with certified transcription for information-theoretic profiling of the piano repertoire

Fred Jalbert-Desforges

AI总结 本文提出了一种从音频直接生成作曲家层面信息论特征的分析流程,通过认证的乐谱转录层(在MAESTRO数据集上F1值达0.9791)提取和声音阶分布,并利用香农熵、非对称KL散度和齐普夫模型进行分析。研究揭示了作曲家在和声可预测性上的可解释排序,重现了已知的风格传承关系,并区分出现代极简主义作曲家与历史作曲家在和声过渡分布上的显著差异。

Comments 25 pages, 4 figures, 25 references

详情
英文摘要

We present an audio-to-analysis pipeline that produces composer-level information-theoretic profiles : reflecting compositional vocabulary as it emerges from aggregated performances : from raw recordings, built on a transcription layer whose accuracy we certify on a standard benchmark (F1 = 0.9791 on the MAESTRO v3.0.0 test set). Applied to 1,238 pieces and 15 MAESTRO composers with at least ten attributed pieces, spanning the Baroque through the early twentieth century, the pipeline derives empirical distributions over harmonic scale degrees and analyzes them through Shannon entropy, asymmetric Kullback-Leibler divergence, and Zipfian rank-frequency modeling. The resulting profiles (i) order composers along an interpretable axis of harmonic predictability, with a narrow entropy range (3.33-3.86 bits) that reveals the marginal-level similarity of tonal vocabularies; (ii) recover known stylistic lineages (Haydn-Beethoven, Liszt-Rachmaninoff, Schubert-Schumann) through the smallest KL divergences in the corpus, with Mendelssohn emerging as a stable outlier within this corpus; and (iii) separate contemporary neoclassical artists (Richter, Frahm, Glass, Arnalds, Jóhannsson) from historical composers on the quality of Zipfian fit to the transition distribution, with mean $R^2 = 0.78$ for neoclassical versus 0.46 for historical (N $\geq$ 10 pieces each). This gap is larger than the spread within either group and is consistent with a minimalist compositional tendency: a compact transition vocabulary used with sharper frequency-rank regularity than historical composers. All estimates are reported with Laplace-smoothed bootstrap 95% confidence intervals.

2605.05927 2026-05-11 cs.CL cs.SD eess.AS 版本更新

Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

Wenqian Cui, Xiao-Hui Li, Daxin Tan, Qiyong Zheng, Irwin King

AI总结 该论文研究了语音大语言模型(SLM)与文本大语言模型(TLM)之间的模态差距问题,提出从输入端减少这一差距的新方法。作者设计了TextPro-SLM,通过结合统一的语音编码器WhisperPro和经过训练的LLM主干网络,使语音输入更接近具有韵律感知能力的文本模型。实验表明,TextPro-SLM在3B和7B规模下均取得最低的模态差距,并在副语言理解任务中表现出色,且仅需约1000小时的训练数据,展示了其高效性。

Comments Work in progress

详情
英文摘要

Speech large language models (SLMs) are typically built from text large language model (TLM) checkpoints, yet they still suffer from a substantial modality gap. Prior work has mainly attempted to reduce this gap from the output side by making speech generation more text-like, but the gap remains. We argue that the key remaining bottleneck lies on the input side. We propose TextPro-SLM, an SLM that makes spoken input more closely resemble that of a prosody-aware text LLM. TextPro-SLM combines WhisperPro, a unified speech encoder that produces synchronized text tokens and prosody embeddings, with an LLM backbone trained to preserve the semantic capabilities of the original TLM while learning paralinguistic understanding. Experiments show that TextPro-SLM achieves the lowest modality gap among leading SLMs at both 3B and 7B scales, while also delivering strong overall performance on paralinguistic understanding tasks. These gains are achieved with only roughly 1,000 hours of LLM training audio, suggesting that reducing the modality gap from the input side is both effective and data-efficient.

2503.05085 2026-05-11 cs.CL cs.SD eess.AS 版本更新

S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models

Feng Jiang, Zhiyu Lin, Yiyang Liu, Liumeng Xue, Fan Bu, Yuhao Du, Xiangying Chen, Benyou Wang, Haizhou Li

AI总结 本文提出S2S-Arena,一个专注于评估语音到语音模型在遵循指令时对语用信息(如语调、情感和说话人特征)理解与表达能力的基准。该基准采用四层交互协议和两阶段数据构建流程,生成涵盖百余项真实任务的1,243个语音样本,并引入无参考的配对比较评估框架。实验表明,当前学术与工业系统在复杂语用场景下存在显著性能差距,研究进一步分析了影响表达式指令遵循的关键设计因素,为构建更自然、鲁棒且符合人类沟通习惯的语音代理提供了指导。

Comments Accepted by ACL 2026 main

详情
英文摘要

Recent advances in large language models (LLMs) have fundamentally reshaped speech-to-speech (S2S) systems, enabling increasingly natural spoken interaction. However, existing benchmarks still rely heavily on text-based evaluation and largely ignore paralinguistic cues such as prosody, emotion, and speaker traits, which are central to expressive and human-like communication. We introduce S2S-Arena, a speech-native benchmark for evaluating instruction-following S2S models with explicit assessment of both semantic understanding and paralinguistic expression. S2S-Arena features a four-level interaction protocol that systematically probes models under increasing paralinguistic complexity, a two-stage data construction pipeline that produces 1,243 speech samples spanning 100+ real-world tasks, and an arena-style evaluation framework that enables reference-free, pairwise comparison directly in the speech modality. Benchmarking 10 state-of-the-art S2S systems over 1,000+ comparisons reveals substantial performance gaps (especially under complex paralinguistic demands) between current academic and industrial systems. Our analysis further identifies key design factors governing expressive instruction following, providing actionable insights for building more natural, robust, and human-aligned speech agents.