arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

检索范围排序方式

检索时间范围

重置

HOT 人工智能、机器人等 9

cs.AI 人工智能 cs.CV 计算机视觉 cs.CL 自然语言处理 cs.RO 机器人 cs.LG 机器学习 cs.SD 声音 cs.ET 新兴技术 eess.AS 音频语音 eess.IV 图像视频

CS 计算机 41

cs 计算机 cs.AI 人工智能 cs.AR 硬件架构 cs.CC 计算复杂性 cs.CE 计算工程 cs.CG 计算几何 cs.CL 自然语言处理 cs.CR 密码安全 cs.CV 计算机视觉 cs.CY 计算机与社会 cs.DB 数据库 cs.DC 分布式计算 cs.DL 数字图书馆 cs.DM 离散数学 cs.DS 数据结构 cs.ET 新兴技术 cs.FL 形式语言 cs.GL 综述文献 cs.GR 图形学 cs.GT 博弈论 cs.HC 人机交互 cs.IR 信息检索 cs.IT 信息论 cs.LG 机器学习 cs.LO 计算机逻辑 cs.MA 多智能体 cs.MM 多媒体 cs.MS 数学软件 cs.NA 数值分析 cs.NE 神经进化 cs.NI 网络架构 cs.OH 其他计算机 cs.OS 操作系统 cs.PF 性能 cs.PL 编程语言 cs.RO 机器人 cs.SC 符号计算 cs.SD 声音 cs.SE 软件工程 cs.SI 社会信息网络 cs.SY 系统控制

ECON 经济学 4

econ 经济学 econ.EM 计量经济 econ.GN 一般经济 econ.TH 理论经济

EESS 电气与系统 5

eess 电气与系统 eess.AS 音频语音 eess.IV 图像视频 eess.SP 信号处理 eess.SY 系统控制

MATH 数学 33

math 数学 math.AC 交换代数 math.AG 代数几何 math.AP 偏微分方程 math.AT 代数拓扑 math.CA 经典分析 math.CO 组合数学 math.CT 范畴论 math.CV 复变函数 math.DG 微分几何 math.DS 动力系统 math.FA 泛函分析 math.GM 一般数学 math.GN 一般拓扑 math.GR 群论 math.GT 几何拓扑 math.HO 历史综述 math.IT 信息论 math.KT K理论 math.LO 逻辑 math.MG 度量几何 math.MP 数学物理 math.NA 数值分析 math.NT 数论 math.OA 算子代数 math.OC 优化控制 math.PR 概率 math.QA 量子代数 math.RA 环与代数 math.RT 表示论 math.SG 辛几何 math.SP 谱理论 math.ST 统计理论

PHYSICS 物理 55

astro-ph 天体物理 astro-ph.CO 宇宙学 astro-ph.EP 地球行星 astro-ph.GA 星系物理 astro-ph.HE 高能天体 astro-ph.IM 天文仪器 astro-ph.SR 太阳恒星 cond-mat 凝聚态 cond-mat.dis-nn 无序神经 cond-mat.mes-hall 介观纳米 cond-mat.mtrl-sci 材料科学 cond-mat.other 其他凝聚态 cond-mat.quant-gas 量子气体 cond-mat.soft 软凝聚态 cond-mat.stat-mech 统计力学 cond-mat.str-el 强关联电子 cond-mat.supr-con 超导 gr-qc 广义相对论 hep-ex 高能实验 hep-lat 格点高能 hep-ph 高能唯象 hep-th 高能理论 math-ph 数学物理 nlin 非线性科学 nlin.AO 自适应系统 nlin.CD 混沌动力学 nlin.CG 胞自动机 nlin.PS 斑图孤子 nlin.SI 可积系统 nucl-ex 核物理实验 nucl-th 核物理理论 physics 物理 physics.acc-ph 加速器物理 physics.ao-ph 大气海洋 physics.app-ph 应用物理 physics.atm-clus 原子分子团簇 physics.atom-ph 原子物理 physics.bio-ph 生物物理 physics.chem-ph 化学物理 physics.class-ph 经典物理 physics.comp-ph 计算物理 physics.data-an 数据分析 physics.ed-ph 物理教育 physics.flu-dyn 流体动力学 physics.gen-ph 普通物理 physics.geo-ph 地球物理 physics.hist-ph 物理史哲 physics.ins-det 仪器探测 physics.med-ph 医学物理 physics.optics 光学 physics.plasm-ph 等离子体 physics.pop-ph 科普物理 physics.soc-ph 物理与社会 physics.space-ph 空间物理 quant-ph 量子物理

Q-BIO 定量生物 11

q-bio 定量生物 q-bio.BM 生物分子 q-bio.CB 细胞行为 q-bio.GN 基因组学 q-bio.MN 分子网络 q-bio.NC 神经认知 q-bio.OT 其他定量生物 q-bio.PE 种群进化 q-bio.QM 定量方法 q-bio.SC 亚细胞过程 q-bio.TO 组织器官

Q-FIN 定量金融 10

q-fin 定量金融 q-fin.CP 计算金融 q-fin.EC 经济学 q-fin.GN 一般金融 q-fin.MF 数学金融 q-fin.PM 投资组合 q-fin.PR 证券定价 q-fin.RM 风险管理 q-fin.ST 统计金融 q-fin.TR 交易微观结构

STAT 统计 7

stat 统计 stat.AP 统计应用 stat.CO 统计计算 stat.ME 统计方法 stat.ML 机器学习 stat.OT 其他统计 stat.TH 统计理论

1. 语音识别与关键词检测 3 篇

2606.19793 2026-06-19 eess.AS cs.AI cs.LG cs.SD eess.SP 交叉投稿

Systematic Study of Dysarthric Speech Recognition: Spectral Features and Acoustic Models

构音障碍语音识别的系统研究：频谱特征与声学模型

Paban Sapkota, Hemant Kumar Kathania, Mikko Kurimo, Sudarsana Reddy Kadiri, Shrikanth Narayanan

AI总结本文系统研究不同频谱特征与声学模型的组合，通过引入音高特征和优化训练帧重叠数，在F-TDNN模型上实现孤立词和句子识别相对提升4.65%和4.63%。

详情

AI中文摘要

识别构音障碍语音的挑战主要源于发音精度受损导致的显著声学变异性。过去的研究表明，通过使用混合DNN/HMM序列区分性训练可以改善识别性能。本文对不同声学模型定制的各种声学特征组合进行了全面研究，为每种模型提供了合适的特征选择。音高特征的引入显著提高了识别性能，特别是对于涉及构音障碍语音的句子识别任务。通过对TORGO数据库的系统检查，我们证明了增强最先进的因子化时延神经网络（F-TDNN）模型识别构音障碍语音性能的潜力。使用F-TDNN模型实现的方法，与先前研究相比，在构音障碍语音的孤立词识别中获得了4.65%的相对改进，在句子识别中获得了4.63%的相对改进。这种改进有效补偿了语音变异性，这归因于我们精心选择了连续训练样本块之间的重叠帧数。

英文摘要

The challenge associated with recognizing dysarthric speech primarily arises from pronounced acoustic variability attributed to impaired articulatory precision. Past research has demonstrated improved recognition through the use of hybrid DNN/HMM sequence discriminative training. This paper presents a comprehensive investigation of various combinations of acoustic features tailored to different Acoustic Models, offering suitable feature selections for each. The incorporation of Pitch features notably improved recognition performance, especially for sentence recognition tasks involving dysarthric speech. Through a systematic examination of the TORGO database, we have demonstrated the potential to enhance the performance of the state-of-the-art Factorized Time Delay Neural Network (F-TDNN) model for recognizing dysarthric speech. Our methods, implemented with the F-TDNN model, resulted in a 4.65\% relative improvement in isolated word recognition and a 4.63\% relative improvement in sentence recognition for dysarthric speech, compared to previous research. This improvement effectively compensates for speech variability, attributable to our deliberate selection of the number of overlapping frames between consecutive training example chunks.

URL PDF HTML ☆

赞 0 踩 0

2606.19910 2026-06-19 cs.CL cs.SD eess.AS 交叉投稿

Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal

轻量级发音评估：基于离散语音标记的意外度

Syeda Faiza Ahmed Sara, Shammur Absar Chowdhury

发表机构 * Qatar Computing Research Institute, Doha, Qatar（卡塔尔计算研究所，多哈，卡塔尔）

AI总结提出仅使用母语语音资源训练的轻量级发音评估框架，通过离散化语音标记和语言模型计算意外度，结合文本引导对齐特征，在无监督或少量校准下达到接近监督方法的性能。

Comments Accepted to Interspeech 2026

详情

AI中文摘要

训练自动发音评估通常依赖于标记的学习者错误或非母语语料库，这些语料库收集成本高昂。我们提出一个轻量级框架，仅使用母语语音资源训练，以无监督或通过少量评分话语进行轻量校准的方式运行。在推理时，学习者语音通过SSL编码器和K-means码本进行离散化。一个在母语序列上训练的标记语言模型计算意外度，其中较高的意外度表示音位偏差。我们添加了一个转录引导的Text2DUnit--DTW模块，该模块从参考文本预测母语标记序列，并将其与声学标记对齐以推导出错误敏感特征。意外度和对齐特征通过简单回归融合。在SpeechOcean762上，PCC从0.60提升到0.66（带转录引导），接近监督基线。在L2-ARCTIC上的跨数据集评估显示了一致的提升。

英文摘要

Training automated pronunciation assessment often relies on labeled learner errors or non-native corpora that are costly to collect. We propose a lightweight framework trained only on native speech resources, operating unsupervised or lightly calibrated with a small set of scored utterances. At inference, learner speech is discretized with an SSL encoder and a K-means codebook. A token language model trained on native sequences computes surprisal where higher surprisal indicates phonotactic deviation. We add a transcript-guided Text2DUnit--DTW module that predicts native token sequences from reference text and aligns them to acoustic tokens to derive error-sensitive features. Surprisal and alignment features are fused via simple regression. On SpeechOcean762, PCC improves from 0.60 to 0.66 with transcript guidance, near supervised baselines. Cross-dataset evaluation on L2-ARCTIC shows consistent gains.

URL PDF HTML ☆

赞 0 踩 0

2606.20106 2026-06-19 eess.AS cs.SD 交叉投稿

Personalized Keyword Spotting for User-Defined Keywords Leveraging Text-Independent Speaker Verification

利用文本无关说话人验证的用户自定义关键词个性化唤醒

Ming-Hsiang Hu, Kuan-Tang Huang, Chien-Chun Wang, Hung-Shin Lee, Berlin Chen

AI总结提出ZP-KWS轻量框架，结合音素监督音频编码器和紧凑说话人编码器，通过乘法后融合实现零样本关键词检测与说话人验证，在多个数据集上将目标误拒率降低高达60%。

Comments Accepted to Interspeech 2026

2. 低资源、多语言与方言语音 2 篇

2606.19791 2026-06-19 eess.AS cs.AI cs.SD 交叉投稿

Cross-Dataset, Age, and Gender Generalization: A Comprehensive Analysis of Fine-Tuning Strategies for Low-Resource Children's ASR

跨数据集、年龄和性别泛化：低资源儿童语音识别的微调策略综合分析

Paban Sapkota, Hemant Kumar Kathania, Mikko Kurimo, Sudarsana Reddy Kadiri, Shrikanth Narayanan

AI总结针对低资源儿童语音识别，系统分析了不同微调策略在跨数据集、年龄和性别泛化上的表现，发现特定策略能显著提升泛化能力。

详情

AI中文摘要

与识别构音障碍语音相关的挑战主要源于发音精度受损导致的显著声学变异性。过去的研究表明，使用混合DNN/HMM序列判别训练可以改善识别性能。本文对不同声学模型定制的各种声学特征组合进行了全面研究，为每种模型提供了合适的特征选择。音高特征的加入显著提升了识别性能，尤其是在涉及构音障碍语音的句子识别任务中。通过对TORGO数据库的系统研究，我们展示了增强最先进的因子化时延神经网络（F-TDNN）模型识别构音障碍语音性能的潜力。我们使用F-TDNN模型实现的方法，与先前研究相比，在孤立词识别上实现了4.65%的相对改进，在句子识别上实现了4.63%的相对改进。这一改进有效补偿了语音变异性，这归因于我们对连续训练样本块之间重叠帧数的精心选择。

英文摘要

URL PDF HTML ☆

赞 0 踩 0

2606.19797 2026-06-19 eess.AS cs.AI cs.SD eess.SP 交叉投稿

Improving End-to-End Speech Recognition for Dysarthric Speech through In-Domain Data Augmentation

通过域内数据增强改进构音障碍语音的端到端语音识别

Paban Sapkota, Hemant Kumar Kathania, Sudarsana Reddy Kadiri, Shrikanth Narayanan

AI总结针对构音障碍语音识别中数据稀缺和严重程度差异的问题，本文探索了四种数据增强方法（SRM、PM、FM、VTLP）对预训练Wav2Vec2模型进行微调，在不同严重程度上实现了显著的字错误率降低。

详情

AI中文摘要

构音障碍语音识别对于促进构音障碍患者之间的有效沟通至关重要。然而，由于严重程度不同和数据可用性有限，准确识别构音障碍语音面临重大挑战。在本文中，我们通过微调端到端预训练Wav2Vec2模型，探索了针对构音障碍自动语音识别（ASR）系统的数据增强技术，特别关注严重程度级别。为了解决数据稀缺以及微调预训练ASR系统用于构音障碍语音时需要大量数据的问题，我们研究了四种主要的数据增强方法：语速修改（SRM）、音高修改（PM）、共振峰修改（FM）和声道长度扰动（VTLP），这些方法针对构音障碍的不同方面进行了调整。本研究使用为每个严重程度类别单独微调的Wav2Vec2模型作为基线系统。此外，我们使用增强数据对ASR模型进行了特定严重程度的微调。结果表明，每种增强技术在不同严重程度级别上表现出不同的有效性模式。对于\textit{低}（9.02%）和\textit{中}（38.11%）严重程度，使用SRM（$s$=0.8）获得了最佳WER；对于\textit{高}严重程度（55.15%），使用PM（$\ au$=0.8）获得了最佳WER，分别相对改进了30.02%、16.64%和15.47%。这些结果证实了增强方法在提高构音障碍ASR性能方面的有效性。

英文摘要

Dysarthric speech recognition is crucial for facilitating effective communication among individuals with dysarthria. However, accurately recognizing dysarthric speech poses significant challenges due to varying severity levels and limited data availability. In this paper, we explore data augmentation techniques for dysarthric automatic speech recognition (ASR) systems by fine-tuning the End-to-End pre-trained Wav2Vec2 model, with a specific focus on severity levels. To address the challenges of data scarcity and the need for extensive data in fine-tuning pre-trained ASR systems for dysarthric speech, we investigate four prominent data augmentation methods: Speaking-Rate Modification (SRM), Pitch Modification (PM), Formant Modification (FM), and vocal tract Length Perturbation (VTLP), tailored to different aspects of dysarthria. The study uses individually fine-tuned Wav2Vec2 models for each severity class as baseline systems. Additionally, we conducted severity-specific fine-tuning of the ASR model using augmented data. Results demonstrate distinct efficacy patterns for each augmentation technique across severity levels. The best WERs were achieved with SRM ($s$=0.8) for \textit{low} (9.02\%) and \textit{medium} (38.11\%) severities, and with PM ($τ$=0.8) for \textit{high} severity (55.15\%), reflecting relative improvements of 30.02\%, 16.64\%, and 15.47\%, respectively. These results confirm the effectiveness of the augmentation methods in improving dysarthric ASR performance.

URL PDF HTML ☆

赞 0 踩 0

3. 数据集、基准与评测 2 篇

2606.19951 2026-06-19 eess.AS cs.CL cs.LG cs.SD 交叉投稿

Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

通过声学和韵律扰动研究语音质量评估中的人机差异

Masato Takagi, Masaya Kawamura, Reo Shimizu, Yuma Shirahata

AI总结通过声学退化、韵律错误和说话人特征扰动，发现MOS预测模型对声学退化敏感，但对韵律错误不敏感，且对基频有偏见，而对语速和基频变化不敏感。

Comments Accepted to INTERSPEECH 2026

详情

AI中文摘要

平均意见得分（MOS）预测模型在文本到语音（TTS）研究中被广泛用作代理指标，但它们捕捉超出声学保真度的质量差异的能力仍不清楚。我们通过控制性扰动来研究这一点：声学退化、韵律错误以及说话人特定特征（如音高和语速）的操纵。我们从人类听众和模型那里获得了这些语音样本的MOS预测，并分析了它们感知特征的差异。结果表明，大多数模型能很好地跟踪声学退化，而所有模型对韵律错误不敏感，尽管主观评分大幅下降。对于说话人特征，模型表现出双重分离：在人类评分中不存在的强平均基频（F0）偏见，但对人类注意到的语速和F0变化不敏感。这些发现突出了标量MOS预测在声学保真度之外的局限性。

英文摘要

Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradation, prosodic errors, and manipulation of speaker-specific characteristics such as pitch and speaking rate. We obtained MOS predictions for these speech samples from both human listeners and the model, and analyzed the differences in their perceptual characteristics. Results show that most models track acoustic degradation well, while all are insensitive to prosodic errors despite large subjective score drops. For speaker characteristics, models exhibit a double dissociation: strong mean fundamental frequency (F0) biases absent in human ratings, yet insensitivity to speaking rate and F0 variability that humans notice. These findings highlight limitations of scalar MOS prediction beyond acoustic fidelity.

URL PDF HTML ☆

赞 0 踩 0

2606.20137 2026-06-19 eess.AS cs.CL cs.LG cs.SD 交叉投稿

PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors

PASQA：针对重音错误的合成语音训练的以音高重音为中心的语音质量评估模型

Masaya Kawamura, Yuma Shirahata, Kentaro Mitsui, Reo Shimizu

AI总结提出PASQA模型，通过可控重音合成数据集和伪重音质量分数，结合自监督表示、摩拉条件融合等训练策略，有效评估音高重音正确性，优于传统MOS模型。

Comments Accepted to INTERSPEECH 2026

详情

AI中文摘要

现有的平均意见得分（MOS）预测模型通常预测话语级别的自然度MOS，并且可能对局部音高重音错误不敏感。我们提出了以音高重音为中心的语音质量评估（PASQA），明确针对音高重音正确性。为了训练我们的模型，我们使用重音可控的文本转语音系统通过改变重音模式构建了一个受控的日语重音错误数据集，并根据重音错误率计算伪重音质量得分。PASQA建立在自监督表示的基础上，并采用摩拉条件融合、排序损失、辅助重音错误定位任务和说话者不变训练。实验表明，传统模型无法保持按重音错误严重程度的排序，而PASQA在已见和未见说话者上都实现了高排序准确性。此外，PASQA与人类重音正确性判断的一致性更强。代码可在以下网址获取：https://this URL。

英文摘要

Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA), which explicitly targets pitch-accent correctness. To train our model, we construct a controlled Japanese accent-error dataset by changing accent patterns using an accent-controllable text-to-speech system, and compute a pseudo accent-quality score from the accent-error rate. PASQA builds on self-supervised representations and employs mora-conditioned fusion, ranking loss, an auxiliary accent-error localization task, and speaker-invariant training. Experiments show that conventional models fail to preserve the ordering by accent-error severity, whereas PASQA achieves high ordering accuracy on both seen and unseen speakers. Further, PASQA shows stronger agreement with human accent-correctness judgments. The code is available at https://github.com/lycorp-jp/PASQA.

URL PDF HTML ☆

赞 0 踩 0