arXivDaily arXiv每日学术速递 周一至周五更新
2605.12387 2026-05-13 cs.SD cs.LG 版本更新

A Semi-Supervised Framework for Speech Confidence Detection using Whisper

Adam Wynn, Jingyun Wang

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文提出了一种半监督框架,用于利用Whisper模型进行语音自信度检测,旨在解决因标注数据有限和副语言标注主观性强而导致的挑战。该框架融合了Whisper编码器提取的深层语义嵌入,以及由eGeMAPS描述符和语音压力、不流畅性概率估计构成的可解释声学特征向量,并引入了一种不确定性感知的伪标签策略以减少对标注数据的依赖。实验表明,该方法在Macro-F1指标上达到0.751,优于多个自监督基线模型,并在小样本类别上提升了3%,验证了显式韵律和辅助特征对提升自信度检测性能的重要作用。

Comments 12 pages, 9 Figures, Submitted to IEEE Transactions on Audio, Speech and Language Processing

详情
英文摘要

Automatic detection of speaker confidence is critical for adaptive computing but remains constrained by limited labelled data and the subjectivity of paralinguistic annotations. This paper proposes a semi-supervised hybrid framework that fuses deep semantic embeddings from the Whisper encoder with an interpretable acoustic feature vector composed of eGeMAPS descriptors and auxiliary probability estimates of vocal stress and disfluency. To mitigate reliance on scarce ground truth data, we introduce an Uncertainty-Aware Pseudo-Labelling strategy where a model generates labels for unlabelled data, retaining only high-quality samples for training. Experimental results demonstrate that the proposed approach achieves a Macro-F1 score of 0.751, outperforming self-supervised baselines, including WavLM, HuBERT, and Wav2Vec 2.0. The hybrid architecture also surpasses the unimodal Whisper baseline, yielding a 3\% improvement in the minority class, confirming that explicit prosodic and auxiliary features provide necessary corrective signals which are otherwise lost in deep semantic representations. Ablation studies further show that a curated set of high confidence pseudo-labels outperforms indiscriminate large scale augmentation, confirming that data quality outweighs quantity for perceived confidence detection.

2605.12310 2026-05-13 cs.SD 版本更新

Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling

Chen Geng, Meng Chen, Ruohua Zhou, Ruolan Liu, Weifeng Zhao

发表机构 * School of Intelligence Science and Technology(智能科学与技术学院) Beijing University of Civil Engineering and Architecture(北京建筑大学) Lyra Lab, Tencent Music Entertainment(腾讯音乐娱乐Lyra实验室) Beijing Key Laboratory of Super Intelligent Technology for Urban Architecture(北京超智能城市建筑技术重点实验室)

AI总结 本文提出了一种名为 Poly-SVC 的多声部感知歌唱语音转换系统,旨在在保留歌词和旋律的前提下,将源歌手的歌声转换为目标歌手的声音。该方法创新性地处理了伴奏录音中的残余和声问题,通过基于常数 Q 变换的音高提取器、随机采样器以及基于条件流匹配的扩散解码器,实现了对旋律与和声特征的融合,从而生成自然且富有表现力的多声部输出。实验表明,Poly-SVC 在自然度、音色相似性和和声重建方面均优于现有基线模型。

Comments Accepted by ICASSP 2026

详情
英文摘要

Singing Voice Conversion (SVC) aims to transform a source singing voice into a target singer while preserving lyrics and melody. Most existing SVC methods depend on F0 extractors to capture the lead melody from clean vocals. However, no existing method can reliably extract clean vocals from accompanied recordings without leaving residual harmonies behind. In this paper, we innovatively propose Poly-SVC, a zero-shot, cross-lingual singing voice conversion system designed to process residual harmonies. Poly-SVC is composed of three key components: a Constant-Q Transform (CQT)-based pitch extractor to preserve both the lead melody and residual harmony, a random sampler to reduce interference information from the CQT and a diffusion decoder based on Conditional Flow Matching (CFM) that fuses pitch, content, and timbre features into natural-sounding polyphonic outputs. Experiments demonstrate that Poly-SVC surpasses the baseline models in naturalness, timbre similarity and harmony reconstruction across both harmony-rich and single-melody recordings.

2605.12287 2026-05-13 eess.AS cs.SD 版本更新

The SMC Blind Spot: A Failure Mode Analysis of State-of-the-Art Beat Tracking

Jaehoon Ahn, Tae Gum Hwang, Moon-Ryul Jung

发表机构 * Sogang University(ソガン大学)

AI总结 近年来,基于深度神经网络的节拍跟踪模型在主流打击乐数据集上表现出色,但在SMC数据集上却始终表现不佳。本文分析了当前最先进的模型在SMC数据集中的失败模式,发现其主要问题包括八度错误、连续性错误以及整体跟踪失败,并指出这些模型容易产生“自信但错误”的激活结果。研究还揭示了标准DBN模型因默认最低节拍限制导致对21%的SMC曲目无法正确推断节拍,从而影响了整体性能,为改进节拍和强拍检测提供了具体方向。

Comments 6 pages, 3 figures. Technical report on beat tracking failure modes; prepared for ISMIR 2026

详情
英文摘要

Over the past two decades, the task of musical beat tracking has transitioned from heuristic onset detection algorithms to highly capable deep neural networks (DNN). Although DNN-based beat tracking models achieve near-perfect performance on mainstream, percussive datasets, the SMC dataset has stubbornly yielded low F-measure scores. By testing how well state-of-the-art models detect beats on individual tracks in the SMC dataset, we identify three distinct failure modes: octave errors, continuity errors, and complete tracking failure where all metrics fall below 0.3. We reveal that state-of-the-art models tend to generate "confident-but-wrong" activations. Furthermore, we show that the standard DBN's default minimum tempo of 55 BPM prevents it from inferring the correct tempo for 21\% of SMC tracks, forcing double-tempo predictions on slow music. By exposing such fundamental oversights, we provide concrete directions for improving beat and downbeat detection, specifically emphasizing training data diversification and multi-hypothesis tempo estimation.

2605.12135 2026-05-13 cs.SD cs.LG eess.AS 版本更新

STRUM: A Spectral Transcription and Rhythm Understanding Model for End-to-End Generation of Playable Rhythm-Game Charts

Joshua Opria

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出STRUM模型,一种无需任何人工标注元数据即可将原始音频转换为可玩的节奏游戏图表(如Clone Hero和YARG)的端到端系统,支持鼓、吉他、贝斯、人声和键盘等乐器。STRUM采用多阶段混合方法,结合卷积循环神经网络(CRNN)进行鼓声起始检测、神经网络进行吉他和贝斯的单音音高跟踪、词对齐的语音识别处理人声,并利用频谱分析检测键盘音符。实验在基于音频质量筛选的30首歌曲数据集上进行,取得了较高的F1分数,并对模型组件进行了全面消融分析。

Comments 9 pages, 4 figures, 3 tables. Code and models: https://github.com/<your-github-username>/autocharter

详情
英文摘要

We present STRUM (Spectral Transcription and Rhythm Understanding Model), an audio-to-chart pipeline that converts raw recordings into playable Clone Hero / YARG charts for drums, guitar, bass, vocals, and keys without any oracle metadata. STRUM is a multi-stage hybrid: a two-stage CRNN onset detector and a six-model ensemble classifier for drums; neural onset detectors with monophonic pitch tracking for guitar and bass; word-aligned ASR for vocals; and spectral keyboard detection for keys. We evaluate on a 30-song in-envelope benchmark constructed by screening candidate songs on a single audio-quality criterion -- the median 1-second drum-stem RMS after htdemucs_6s source separation. On this benchmark STRUM achieves drums onset F1 = 0.838, bass F1 = 0.694, guitar F1 = 0.651, and vocals F1 = 0.539 at a +/- 100 ms tolerance with per-song global offset search. We report a complete ablation of seven drum-pipeline components with paired per-song Wilcoxon tests, an analysis of ground-truth-to-audio timing distributions in community Clone Hero charts, and a per-class confusion matrix for the drum classifier. Code, model weights, and the full benchmark manifest are released.

2601.09448 2026-05-13 cs.SD cs.AI 版本更新

One Prompt, Many Sounds: Modeling Listener Variability in LLM-Based Equalization

Ioannis Stylianou, Jon Francombe, Pablo Martinez-Nuevo, Sven Ewan Shepstone, Zheng-Hua Tan

发表机构 * Bang & Olufsen A/S, Struer, Denmark(丹麦Bang & Olufsen A/S公司,Struer) Department of Electronic Systems, Aalborg University(奥胡斯大学电子系统系) Pioneer Centre for AI, Copenhagen, Denmark(哥本哈根先锋人工智能中心)

AI总结 本文提出了一种基于大语言模型(LLM)的音频均衡方法,通过自然语言提示映射到均衡设置,实现了对声音系统的对话式控制。该方法利用受控听音实验收集的数据,结合上下文学习和参数高效微调技术,使模型能够可靠地对齐人群偏好的均衡设置。实验结果表明,与随机采样和静态预设基线相比,该方法在分布对齐方面有显著提升,展示了LLM作为“人工均衡器”的潜力,为更易用、上下文感知和专家级的音频调音方法提供了新方向。

Comments 13 pages, 15 figures, 2 tables, IEEE JSTSP submission

详情
英文摘要

Conventional audio equalization is a static process that requires manual and cumbersome adjustments to adapt to changing listening contexts (e.g., mood, location, or social setting). In this paper, we introduce a Large Language Model (LLM)-based alternative that maps natural language text prompts to equalization settings. This enables a conversational approach to sound system control. By utilizing data collected from a controlled listening experiment, our models exploit in-context learning and parameter-efficient fine-tuning techniques to reliably align with population-preferred equalization settings. Our evaluation methods, which leverage distributional metrics that capture users' varied preferences, show statistically significant improvements in distributional alignment over random sampling and static preset baselines. These results indicate that LLMs could function as "artificial equalizers," contributing to the development of more accessible, context-aware, and expert-level audio tuning methods.

2511.10670 2026-05-13 cs.CL cs.AI cs.SD 版本更新

Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment

Yan Gao, Yazheng Yang, Zhibin Lan, Yidong Chen, Min Zhang, Daimeng Wei, Derek F. Wong, Jinsong Su

发表机构 * School of Informatics, Xiamen University, China(厦门大学信息学院) Huawei Translation Services Center, Beijing, China(华为翻译服务中心) NLP 2 CT Lab, Department of Computer and Information Science, University of Macau(澳门大学计算机与信息科学系NLP 2 CT实验室)

AI总结 该研究旨在解决代码混用(Code-switching)语音翻译中的细粒度语义建模难题,提出了一种结合专家混合(MoE)结构的语音投影方法,通过语言专家组对不同语言的语义空间进行精细化建模。研究引入了语言特定损失和组内负载均衡损失,以提升模型效率,并采用多阶段训练策略,结合现有自动语音识别和单语翻译数据,增强对齐效果和翻译性能。实验表明,该方法在多个数据集上显著优于现有模型,BLEU和COMET指标均有明显提升。

Comments Accepted to IJCAI 2026 Main Track

详情
英文摘要

Code-switching (CS) speech translation (ST) aims to translate speech that alternates between multiple languages into a target language text, posing significant challenges due to the complexity of semantic modeling and the scarcity of CS data. Previous studies mainly rely on the models themselves to implicitly learn semantic representations and resort to costly manual annotations. To mitigate these limitations, we propose enhancing Large Language Models (LLMs) with a Mixture-of-Experts (MoE) speech projector composed of language expert groups, where each group specializes in the semantic space of a specific language for fine-grained speech feature modeling. A language-specific loss and an intra-group load balancing loss are jointly introduced to guide efficient token routing across and within expert groups. Furthermore, we introduce a multi-stage training paradigm that utilizes readily available automatic speech recognition (ASR) and monolingual ST data, facilitating speech-text alignment and improving translation performance. To bridge the data gap for smooth domain transfer, a transition loss is employed to improve adaptation to CS scenarios. Extensive experiments on widely used datasets demonstrate the effectiveness and generality of our approach, achieving average improvements of $0.86$ BLEU and $0.93$ COMET over SeamlessM4T, with maximum improvements of $1.49$ BLEU and $1.41$ COMET across different test sets.

2509.13548 2026-05-13 cs.SD eess.AS stat.ML 版本更新

Mixture-of-Experts Framework for Field-of-View Enhanced Signal-Dependent Binauralization of Moving Talkers

Manan Mittal, Thomas Deppisch, Joseph Forrer, Chris Le Sueur, Zamir Ben-Hur, David Lou Alon, Daniel D. E. Wong

发表机构 * Stony Brook University(史泰森布鲁克大学) Chalmers University of Technology(挑战大学) Reality Labs Research, Meta(现实实验室研究,Meta)

AI总结 本文提出了一种基于专家混合框架的新型方法,用于增强移动说话人声源的视野感知双耳渲染。该方法通过隐式定位在线融合多个双耳滤波器,实现了对连续运动声源的实时追踪与增强,能够在保持自然双耳线索的同时,突出或抑制特定方向的声音。与传统依赖到达方向估计或基于Ambisonics域的方法不同,该信号依赖框架具有阵列结构无关性,适用于下一代消费音频设备中的空间音频捕获与个性化播放。

Comments 5 pages, 3 figures

详情
英文摘要

We propose a novel mixture of experts framework for field-of-view enhancement in binaural signal matching. Our approach enables dynamic spatial audio rendering that adapts to continuous talker motion, allowing users to emphasize or suppress sounds from selected directions while preserving natural binaural cues. Unlike traditional methods that rely on explicit direction-of-arrival estimation or operate in the Ambisonics domain, our signal-dependent framework combines multiple binaural filters in an online manner using implicit localization. This allows for real-time tracking and enhancement of moving sound sources, supporting applications such as speech focus, noise reduction, and world-locked audio in augmented and virtual reality. The method is agnostic to array geometry offering a flexible solution for spatial audio capture and personalized playback in next-generation consumer audio devices.

2412.13050 2026-05-13 cs.LG cs.AI cs.CL cs.CV cs.SD eess.AS 版本更新

Modality-Inconsistent Continual Learning of Multimodal Large Language Models

Weiguo Pian, Shijian Deng, Shentong Mo, Mingrui Liu, Yunhui Guo, Yapeng Tian

发表机构 * The University of Texas at Dallas(德克萨斯大学达拉斯分校) Carnegie Mellon University(卡内基梅隆大学) George Mason University(乔治·梅森大学)

AI总结 本文提出了一种新的多模态大语言模型持续学习场景——模态不一致持续学习(MICL),该场景涉及图像、音频或视频等不一致模态以及图文生成或问答等不同任务类型的持续学习任务。为应对模态和任务类型变化带来的灾难性遗忘问题,研究提出了MoInCL方法,通过伪目标生成模块和基于指令的知识蒸馏技术,有效缓解了模态和任务类型变化对模型性能的影响。实验结果表明,MoInCL在多个任务上优于现有的持续学习方法,具有显著优势。

Comments Accepted at Transactions on Machine Learning Research (TMLR), 2026

详情
英文摘要

In this paper, we introduce Modality-Inconsistent Continual Learning (MICL), a new continual learning scenario for Multimodal Large Language Models (MLLMs) that involves tasks with inconsistent modalities (image, audio, or video) and varying task types (captioning or question-answering). Unlike existing vision-only or modality-incremental settings, MICL combines modality and task type shifts, both of which drive catastrophic forgetting. To address these challenges, we propose MoInCL, which employs a Pseudo Targets Generation Module to mitigate forgetting caused by task type shifts in previously seen modalities. It also incorporates Instruction-based Knowledge Distillation to preserve the model's ability to handle previously learned modalities when new ones are introduced. We benchmark MICL using a total of six tasks and conduct experiments to validate the effectiveness of our MoInCL. The experimental results highlight the superiority of MoInCL, showing significant improvements over representative and state-of-the-art continual learning baselines.

2605.11286 2026-05-13 eess.SP cs.SD eess.AS 版本更新

Adaptive Diagonal Loading using Krylov Subspaces for Robust Beamforming

Manan Mittal, Ryan M. Corey, John R. Buck, Andrew C. Singer

发表机构 * Stony Brook University(史泰森布鲁克大学) University of Illinois Chicago(伊利诺伊大学芝加哥分校) University of Massachusetts Dartmouth(马萨诸塞大学达特茅斯分校)

AI总结 本文针对大阵列麦克风在动态声学环境中进行自适应波束成形时面临的数据快照不足问题,提出了一种基于Krylov子空间的自适应对角加载方法。该方法利用Lanczos迭代构建小规模Krylov子空间,将协方差矩阵投影到低维三对角矩阵,从而高效估计其极值特征值,显著降低了计算复杂度。实验表明,该方法在保证波束成形性能和白噪声增益严格约束的同时,计算成本仅为传统特征值分解方法的很小一部分。

Comments 5 pages, 8 figures

详情
英文摘要

Reliable adaptive beamforming is critical for large microphone arrays operating in highly dynamic acoustic environments. In scenarios characterized by fast-moving talkers and interferers, the available sample support for estimating the spatial correlation matrix is often snapshot-deficient. This deficiency degrades the White Noise Gain (WNG), leading to severe target signal cancellation. To ensure stable and robust beamforming, we previously proposed an adaptive diagonal loading method that leverages the Kantorovich inequality to guarantee the WNG remains strictly within specified bounds. However, accurately determining the smallest necessary loading level requires calculating the extreme eigenvalues of the spatial correlation matrix, a computationally expensive $\mathcal{O}(M^3)$ operation for large arrays. In this paper, we introduce a highly efficient $\mathcal{O}(kM^2)$ estimation technique using Lanczos iterations to build a small Krylov subspace. By projecting the correlation matrix onto a tridiagonal matrix of dimension $k \ll M$, we extract Ritz values that rapidly converge to the exact extreme eigenvalues. Our evaluations demonstrate that this Lanczos-accelerated approach achieves performance identical to exact Eigenvalue Decomposition (EVD), ensuring optimal interference suppression and strict WNG adherence at a fraction of the computational cost.

2605.11192 2026-05-13 cs.SD cs.AI cs.LG 版本更新

Exploring Token-Space Manipulation in Latent Audio Tokenizers

Francesco Paissan, Luca Della Libera, Mirco Ravanelli, Cem Subakan

发表机构 * Mila – Québec AI Institute(魁北克人工智能研究所) Université Laval(拉瓦尔大学) Concordia University(康科迪亚大学)

AI总结 本文研究了在潜空间音频编码器中对 token 空间进行操作的可能性,提出了一种名为 LATTE 的新型音频 tokenizer,通过引入可学习的潜空间 token 来实现对全局语音特征的编辑。该方法在保持高质量语音重建的同时,使得通过替换 token 来修改说话人身份或背景噪声等全局属性成为可能,并在语音转换和去噪任务中验证了其有效性,为无监督的可控音频编辑提供了新思路。

详情
英文摘要

Neural audio codecs provide compact discrete representations for speech generation and manipulation. However, most codecs organize tokens as frame-level sequences, making it difficult to study or intervene on global factors of variation. In this work, we propose the Latent Audio Tokenizer for Token-space Editing (LATTE) that appends a fixed set of learnable latent tokens to the audio feature sequence and retains only these tokens for quantization and decoding. This design produces a compact, non-temporally aligned bottleneck in which each token can aggregate global information across the full utterance. We show that the resulting tokenizer preserves competitive reconstruction quality in low-bitrate speech coding settings while enabling simple token-space interventions. In particular, we find that swapping selected latent token positions between utterances can modify global attributes, such as speaker identity and background noise, and we evaluate these interventions on voice conversion and denoising tasks. Our results suggest that compact latent audio tokenizers can support controllable audio manipulation without supervision in task-specific editing models.

2605.11098 2026-05-13 cs.SD 版本更新

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

Jiacheng Shi, Hongfei Du, Xinyuan Song, Y. Alicia Hong, Yanfu Zhang, Ye Gao

发表机构 * College of William & Mary(威廉姆斯与玛丽学院) Emory University(埃默里大学) George Mason University(乔治·梅森大学)

AI总结 AffectCodec 是一种用于情感表达语音建模的情绪感知神经语音编解码器,旨在在量化过程中保留语音中的情感信息。该方法通过结合情感语义引导的潜在调制、关系保持的情感语义蒸馏和情感加权语义对齐,实现了在压缩过程中保持语义保真度和韵律自然性的同时保留情感关键线索。实验表明,AffectCodec 在语音重建、情感识别和下游文本到语音生成任务中均表现出更优的情感一致性和感知质量。

Comments Accepted to ACL Findings 2026

详情
英文摘要

Neural speech codecs provide discrete representations for speech language models, but emotional cues are often degraded during quantization. Existing codecs mainly optimize acoustic reconstruction, leaving emotion expressiveness insufficiently modeled at the representation level. We propose an emotion-guided neural speech codec that explicitly preserves emotional information while maintaining semantic fidelity and prosodic naturalness. Our framework combines emotion-semantic guided latent modulation, relation-preserving emotional-semantic distillation, and emotion-weighted semantic alignment to retain emotionally salient cues under compression. Extensive evaluations across speech reconstruction, emotion recognition, and downstream text-to-speech generation demonstrate improved emotion consistency and perceptual quality without sacrificing content accuracy.

2602.16416 2026-05-13 eess.AS cs.SD 版本更新

Online Single-Channel Audio-Based Sound Speed Estimation for Robust Multi-Channel Audio Control

Andreas Jonas Fuglsig, Mads Græsbøll Christensen, Jesper Rindom Jensen

AI总结 该研究旨在解决多通道音频控制中因声速变化导致的系统性误差问题。提出了一种基于单通道音频的在线声速估计方法,无需额外校准或多个麦克风,在播放音频过程中实时估计声速。该方法通过最小化实际音频与参数化声学模型之间的差异来估计声速,实验表明其能有效跟踪不同输入信号下的声速变化,并提升空间音频控制的性能。

Comments Accepted for publication at EUSIPCO 2026

详情
英文摘要

Robust spatial audio control relies on accurate acoustic propagation models, yet environmental variations, especially changes in the speed of sound, cause systematic mismatches that degrade performance. Existing methods either assume known sound speed, require multiple microphones, or rely on separate calibration, making them impractical for systems with minimal sensing. We propose an online sound speed estimator that operates during general multichannel audio playback and requires only a single observation microphone. The method exploits the structured effect of sound speed on the reproduced signal and estimates it by minimizing the mismatch between the measured audio and a parametric acoustic model. Simulations show accurate tracking of sound speed for diverse input signals and improved spatial control performance when the estimates are used to compensate propagation errors in a sound zone control framework.

2402.07619 2026-05-13 cs.SD cs.AI eess.AS 版本更新

Developing a Multi-variate Prediction Model For COVID-19 From Crowd-sourced Respiratory Voice Data

Yuyang Yan, Wafaa Aljbawi, Sami O. Simons, Visara Urovi

发表机构 * Institute of Data Science, Maastricht University(数据科学研究所,马斯特里赫特大学) Department of Respiratory Medicine, Maastricht University Medical Center, Maastricht University(呼吸科部门,马斯特里赫特大学医学中心,马斯特里赫特大学)

AI总结 该研究旨在开发一种基于众包呼吸道语音数据的多变量深度学习模型,用于检测 COVID-19。研究利用 Cambridge COVID-19 Sound 数据库中的语音样本,提取包括梅尔频谱图、MFCC 和 CNN 编码器特征等多种语音特征,并构建了 LSTM、CNN 和 HuBERT 等深度学习分类模型进行疾病识别。实验结果表明,HuBERT 模型在准确率和 AUC 指标上均优于传统机器学习方法,达到了 86% 和 0.93,展示了语音数据在 COVID-19 诊断中的巨大潜力。

Comments arXiv admin note: text overlap with arXiv:2209.03727

详情
英文摘要

COVID-19 has affected more than 223 countries worldwide and in the Post-COVID Era, there is a pressing need for non-invasive, low-cost, and highly scalable solutions to detect COVID-19. We develop a deep learning model to identify COVID-19 from voice recording data. The novelty of this work is in the development of deep learning models for COVID-19 identification from only voice recordings. We use the Cambridge COVID-19 Sound database which contains 893 speech samples, crowd-sourced from 4352 participants via a COVID-19 Sounds app. Voice features including Mel-spectrograms and Mel-frequency cepstral coefficients (MFCC) and CNN Encoder features are extracted. Based on the voice data, we develop deep learning classification models to detect COVID-19 cases. These models include Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) and Hidden-Unit BERT (HuBERT). We compare their predictive power to baseline machine learning models. HuBERT achieves the highest accuracy of 86\% and the highest AUC of 0.93. The results achieved with the proposed models suggest promising results in COVID-19 diagnosis from voice recordings when compared to the results obtained from the state-of-the-art.