arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.25967 2026-05-26 cs.LG cs.SD 版本更新

Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio

隐藏在明文令牌中:简单、鲁棒、无梯度的合成音频水印

Georgios Milis, Yubin Qin, Yihan Wu, Heng Huang

发表机构 * Department of Computer Science, University of Maryland, College Park, USA(大学马里兰大学计算机科学系)

AI总结 本文利用离散化中的词汇冗余,提出一种无需微调或梯度的合成音频水印方法,通过社区检测缩减词汇表提升检测鲁棒性,在音频修改下仍保持高性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

随着政策追赶生成式AI的能力,水印技术成为内容溯源工作的核心。自回归模型的推理时水印由于离散化不一致而不适用于连续模态。现有方法通过微调模态分词器来克服这一问题,但失去了水印无需训练的优势。在这项工作中,受离散化中词汇冗余的启发,我们提出了一种优雅的解决方案,用于合成音频的强大且鲁棒的水印。我们从理论上分析了令牌错误对水印检测的影响,并通过社区检测获得的缩减词汇表有效缓解了这些问题。充分的实验表明,我们的无梯度方法可以将可检测性提高几个数量级,同时实现对音频修改的内置鲁棒性。广泛地说,我们发现了多媒体中令牌级水印的新最先进技术,这仅仅源于离散表示学习的本质。

英文摘要

As policy catches up with the capabilities of generative AI, watermarking is central to content provenance efforts. Inference-time watermarks for autoregressive models are unfit for continuous modalities due to discretization inconsistencies. Existing methods overcome this by finetuning the modality tokenizers, nullifying the watermark's training-free advantage. In this work, motivated by the vocabulary redundancy of discretization, we propose an elegant solution for powerful and robust watermarking of synthetic audio. We theoretically analyze the impact of token errors on watermark detection, and effectively mitigate them using a reduced vocabulary obtained via community detection. Thorough experiments showcase that our gradient-free method can boost detectability by several orders of magnitude, while also achieving built-in robustness to audio modifications. Broadly, we discover a new state-of-the-art for token-level watermarks in multimedia, which simply arises from the nature of discrete representation learning.

2605.25962 2026-05-26 cs.SD cs.AI 版本更新

Continual Speaker Identity Unlearning with Minimal Interference

持续说话人身份遗忘与最小干扰

Jinju Kim, Yunsung Kang, Gyeong-Moon Park, Jong Hwan Ko

发表机构 * Sungkyunkwan University(成均馆大学) Korea University(韩国大学)

AI总结 提出CORTIS框架,通过Fisher信息参数掩码和正交投影实现零样本语音合成中持续说话人身份遗忘,避免先前遗忘的说话人重新出现。

Comments preprint

详情
AI中文摘要

机器遗忘从预训练模型中移除指定概念或知识。最近的工作将此范式扩展到零样本语音合成(ZS-TTS)中的说话人身份遗忘,即选择性擦除模型复制说话人声音的能力。然而,现有方法默认所有遗忘请求同时到达,这是一个不现实的假设,因为隐私驱动的移除会随时间顺序到达。我们证明这一假设破坏了现有最先进的方法:遗忘每个新说话人会完全恢复先前遗忘的说话人,重新引入遗忘本应消除的隐私风险。我们提出了累积正交身份抑制(CORTIS),这是首个在ZS-TTS中实现持续说话人身份遗忘的框架,无需访问先前遗忘的说话人数据。CORTIS结合了基于Fisher信息的参数掩码(将更新定位到与说话人相关的权重)和针对先前遗忘更新子空间的正交投影。使用VoiceBox,CORTIS在长请求序列中遗忘每个请求的说话人,同时保持先前遗忘的说话人被遗忘,显著优于先前方法的顺序应用。演示地址:https://cumulativeortis.github.io/ 。

英文摘要

Machine unlearning removes designated concepts or knowledge from pre-trained models. Recent work has extended this paradigm to speaker identity unlearning in zero-shot text-to-speech (ZS-TTS), the task of selectively erasing a model's ability to replicate a speaker's voice. Existing methods, however, quietly assume all unlearning requests arrive at once; an unrealistic assumption, since privacy-motivated removals arrive sequentially over time. We show this assumption breaks state-of-the-art methods: unlearning each new speaker fully revives previously unlearned speakers, reintroducing the very privacy risk unlearning was meant to eliminate. We present Cumulative ORThogonal Identity Suppression (CORTIS), the first framework for continual speaker identity unlearning in ZS-TTS that requires no access to previously-unlearned speaker data. CORTIS combines Fisher-information-based parameter masking, which localizes updates to speaker-relevant weights, with orthogonal projection against subspaces spanned by prior unlearning updates. With VoiceBox, CORTIS unlearns each requested speaker while keeping previously unlearned speakers forgotten across long request sequences, substantially outperforming sequential application of prior methods. The demo is available at https://cumulativeortis.github.io/ .

2605.25951 2026-05-26 cs.SD 版本更新

Score-Agnostic Structure Analysis in Large-Scale Performance Datasets

大规模演奏数据集中的无乐谱结构分析

Patricia Hu, Silvan Peter, Gerhard Widmer

发表机构 * Institute of Computational Perception, Johannes Kepler University(计算感知研究所,约瑟夫·凯泽大学) LIT AI Lab, Linz Institute of Technology(林茨技术学院LIT人工智能实验室)

AI总结 针对大规模自动转录钢琴演奏数据集中结构不一致的问题,提出基于序列比对和层次聚类的无乐谱分组方法,以音乐连贯性替代真实准确性作为评估标准。

Comments published at the Music Encoding Conference (MEC) 2026

详情
AI中文摘要

近年来,得益于自动音乐转录(AMT)的进步,多个大规模自动转录钢琴独奏音乐数据集已发布。虽然这些数据集无疑为演奏研究提供了丰富的材料,但它们在质量上差异很大。在古典音乐中,演奏不仅在速度等表现方面不同,而且在乐谱的结构解释(包括重复模式和版本特定变体)上也存在差异。为了有意义地将大规模转录数据集用于演奏研究,同一作品的转录必须根据其底层结构实现进行分组,以支持有效比较。我们通过应用序列到序列比对后进行层次聚类来解决这个问题:我们为给定作品的所有转录对创建成对比对,并使用比对成本和演奏序列长度的(不)相似性来解决结构不匹配问题,作为分组的特征。我们提出这种方法作为自动评估缺乏真实乐谱和/或音频的大规模转录数据集的第一步,将评估标准从基于真实性的准确性转向音乐连贯性和合理性。我们在最近发布的大规模转录钢琴演奏数据集中约1,500个转录(涵盖88部作品)上展示了我们的无乐谱方法。

英文摘要

In recent years, thanks to advances in automatic music transcription (AMT), several large-scale datasets of automatically transcribed piano solo music have been released. While these datasets undoubtedly offer extensive material for performance studies, they vary substantially in quality. In the case of classical music, performances often differ not only in expressive aspects such as tempo, but also in their structural interpretation of the score (including repeat patterns and edition-specific variants). To meaningfully use large-scale transcribed datasets for performance research, transcriptions of the same piece must be grouped according to their underlying structural realisation to support valid comparison. We address this by applying sequence-to-sequence alignment followed by hierarchical clustering: we create pairwise alignments for all pairs of transcriptions of a given piece, and use the alignment cost and (dis)similarity of performed sequence lengths to resolve structural mismatches as features for grouping. We propose this approach as a first step towards automatically evaluating large-scale transcribed datasets that lack ground-truth score and/or audio, shifting the evaluation criterion from truth-based accuracy to musical coherence and plausibility. We demonstrate our score-agnostic approach on around 1,500 transcriptions of 88 compositions from a recently published large-scale transcribed piano performance dataset.

2605.25928 2026-05-26 cs.CL cs.SD eess.AS 版本更新

Thaka at KSAA-2026 Task 2: Regularized Fine-Tuning for Arabic Speech Diacritization

Thaka at KSAA-2026 Task 2: 用于阿拉伯语音节符号化的正则化微调

Meshal Alamr, Hassan Alqaeri, Abdullah Aldahlawi

发表机构 * Thaka

AI总结 针对低资源阿拉伯语音节符号化任务,通过正则化微调CATT-Whisper多模态模型,结合R-Drop一致性正则化、Optuna优化超参数和Focal Loss,在KSAA-2026共享任务中取得第一名。

Comments 4 pages, 1 figure. Published in Proceedings of OSACT7 (LREC 2026). Winning system for KSAA-2026 Task 2 on Arabic Speech Diacritization

详情
AI中文摘要

我们描述了KSAA-2026阿拉伯语音听写自动音节符号化共享任务Task 2的获胜系统。该任务要求从语音音频和无音节符号的转录文本中生成完全带音节符号的阿拉伯语文本,仅提供2,327个训练样本且不允许使用外部数据。我们的系统微调了CATT-Whisper,这是一个字符级多模态模型,结合了预训练的CATT文本编码器和冻结的Whisper语音编码器。我们方法的关键是训练正则化:R-Drop一致性正则化、使用高权重衰减的Optuna优化超参数以及Focal Loss。在推理时,我们在四个模型检查点上使用蒙特卡洛Dropout在softmax概率级别平均200次随机前向传播。该系统在主要排行榜指标(包括词尾变化,含无音节符号位置)上实现了23.26%的词错误率,在所有参与者中排名第一。

英文摘要

We describe the winning system for Task 2 of the KSAA-2026 Shared Task on Arabic Speech Dictation with Automatic Diacritization. The task requires producing fully diacritized Arabic text from speech audio and undiacritized transcripts, with only 2,327 training samples available and no external data permitted. Our system fine-tunes CATT-Whisper, a character-level multimodal model combining a pretrained CATT text encoder with a frozen Whisper speech encoder. The key to our approach is training regularization: R-Drop consistency regularization, Optuna-optimized hyperparameters with high weight decay, and Focal Loss. At inference, we average 200 stochastic forward passes across four model checkpoints using Monte Carlo Dropout at the softmax probability level. The system achieves 23.26% WER on the primary leaderboard metric (with case endings, including no-diacritic positions), placing 1st among all participants.

2605.25540 2026-05-26 cs.SD cs.LG 版本更新

A Multimodal Framework for Dementia Detection via Linguistic and Acoustic Representation Learning

基于语言和声学表征学习的多模态痴呆检测框架

Loukas Ilias, Dimitris Askounis

发表机构 * Decision Support Systems Laboratory, School of Electrical and Computer Engineering, National Technical University of Athens(决策支持系统实验室,电气与计算机工程学院,国家技术大学雅典)

AI总结 提出一个端到端可训练的多模态深度学习框架,通过预训练模型提取声学和文本特征,结合注意力融合与互信息最大化,实现自动痴呆检测。

详情
AI中文摘要

阿尔茨海默病(AD)是一种进行性神经退行性疾病,是痴呆的主要原因,影响记忆、推理、沟通和日常功能。早期诊断尤为重要,因为及时干预可能有助于减缓认知衰退并改善患者护理。最近的研究表明,自发性言语包含与痴呆相关的有价值的语言和声学生物标志物。然而,现有方法通常依赖于独立训练的模态特定模型、特征拼接策略、集成方法或基于注意力的融合机制,这些方法并未明确最大化语音和转录表示之间的依赖性。在这项工作中,我们提出了一种用于自动痴呆检测的多模态深度学习框架,该框架以端到端可训练的方式联合利用语音和转录信息。具体来说,语音录音被分割成10秒的片段,并通过预训练的HuBERT模型提取上下文化的声学表示。为了更好地捕捉信息丰富的时域语音特征,采用注意力统计池化来聚合帧级声学嵌入。对于文本模态,使用预训练的BERT模型对转录进行编码,其中[CLS]标记表示用作语言嵌入。随后,使用基于注意力的音频-文本融合(AT-Fusion)机制组合声学和文本表示。此外,我们引入了一个MINE目标,以最大化模态之间的互信息并改善多模态表示对齐。最终融合的多模态表示用于痴呆分类。在公开的ADReSS挑战赛和PROCESS-2数据集上进行的实验证明了所提方法在基于语音的痴呆评估中的有效性和鲁棒性。

英文摘要

Alzheimer's disease (AD) is a progressive neurodegenerative disorder and the leading cause of dementia, affecting memory, reasoning, communication, and daily functioning. Early diagnosis is particularly important, as timely intervention may help slow cognitive decline and improve patient care. Recent studies have demonstrated that spontaneous speech contains valuable linguistic and acoustic biomarkers associated with dementia. However, existing approaches often rely on independently trained modality-specific models, feature concatenation strategies, ensemble methods, or attention-based fusion mechanisms that do not explicitly maximize the dependency between speech and transcript representations. In this work, we propose a multimodal deep learning framework for automatic dementia detection that jointly exploits speech and transcript information in an end-to-end trainable manner. Specifically, speech recordings are divided into 10-second segments and passed through a pre-trained HuBERT model to extract contextualized acoustic representations. To better capture informative temporal speech characteristics, attentive statistics pooling is employed to aggregate frame-level acoustic embeddings. For the textual modality, transcripts are encoded using a pre-trained BERT model, where the [CLS] token representation is used as the linguistic embedding. The acoustic and textual representations are subsequently combined using an attention-based Audio-Text Fusion (AT-Fusion) mechanism. In addition, we introduce a MINE objective to maximize the mutual information between modalities and improve multimodal representation alignment. The fused multimodal representation is finally used for dementia classification. Experiments conducted on the publicly available ADReSS Challenge and PROCESS-2 dataset demonstrate the effectiveness and robustness of the proposed approach for speech-based dementia assessment.

2605.18916 2026-05-26 cs.MM cs.AI cs.CV cs.SD eess.AS 版本更新

CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation

CounterFlow: 一种用于反事实视频拟音生成的两阶段推理时采样方法

Gyubin Lee, Junwon Lee, Juhan Nam

发表机构 * Kim Jaechul Graduate School of AI, KAIST(金 Jaechul人工智能研究生院,韩国科学技术院)

AI总结 提出CounterFlow,一种两阶段推理时采样方案,用于预训练的流匹配VT2A模型,以生成与视觉证据矛盾但时间同步的反事实视频拟音,并通过新指标评估替换质量。

Comments accepted to CVPR 2026 Workshop on Sight and Sound

详情
AI中文摘要

我们研究反事实视频拟音生成,旨在采用与视觉证据矛盾的声源身份,同时保持与无声视频的时间同步。现有的视频与文本到音频(VT2A)模型难以处理此问题,当视频和文本内容不一致时,它们往往仍锚定于视觉隐含的声源。我们提出CounterFlow,一种用于预训练流匹配VT2A模型的推理时双阶段采样方案。第一阶段构建视频衍生的时间结构,同时抑制视觉隐含的声源;第二阶段放弃视频条件,完全专注于塑造朝向目标提示的音频音色。与朴素的负提示和最新基线相比,CounterFlow显著改进了反事实视频拟音生成。为了评估替换质量,我们提出一个利用文本-音频共嵌入空间的度量,同时衡量目标提示证据和残留的视觉隐含声源泄漏。视频演示和代码可在https://gyubin-lee.github.io/counterflow-demo/获取。

英文摘要

We investigate Counterfactual Video Foley Generation, which aims to adopt a sound-source identity that contradicts the visual evidence while remaining temporally synchronized to a silent video. Existing Video&Text-to-Audio (VT2A) models struggle with this, often remaining anchored to the visually implied sound source when video and text contents disagree. We present ConterFlow, an inference-time dual-phase sampling scheme for pretrained flow-matching VT2A models. Phase 1 builds a video-derived temporal structure while suppressing the visually implied source; Phase 2 drops video conditioning to focus entirely on shaping audio timbre toward the target prompt. ConterFlow substantially improves counterfactual Video Foley generation compared to naive negative prompting and state-of-the-art baselines. To evaluate replacement quality, we propose a metric leveraging a text-audio co-embedding space to measure both target-prompt evidence and residual visually implied source leakage. Video demonstrations and code are available at https://gyubin-lee.github.io/counterflow-demo/

2605.04700 2026-05-26 cs.CR cs.AI cs.CL cs.LG cs.SD 版本更新

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

稀疏令牌足矣:通过令牌感知梯度优化越狱音频语言模型

Zheng Fang, Xiaosen Wang, Shenyi Zhang, Shaokang Wang, Zhijin Ge

发表机构 * Wuhan University Institute for Math \& AI, Wuhan University Huazhong University of Science Shanghai Jiao Tong University Xidian University

AI总结 本文提出令牌感知梯度优化(TAGO)方法,通过仅保留高梯度能量的音频令牌对应的波形梯度,实现稀疏越狱攻击,在保持高成功率的同时大幅减少优化量。

Comments To appear in the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

对音频语言模型(ALM)的越狱攻击通过优化音频扰动来引发不安全生成,通常在整个优化过程中密集地更新整个波形。在这项工作中,我们通过分析ALM中令牌对齐梯度的结构来研究这种密集优化的必要性。我们发现梯度能量在音频令牌之间高度不均匀,表明只有一小部分令牌对齐的音频区域主导了优化信号。受此观察启发,我们提出了令牌感知梯度优化(TAGO),它通过每次迭代仅保留与高梯度能量音频令牌对齐的波形梯度,同时屏蔽其余梯度,实现了稀疏越狱优化。在三个ALM上,TAGO优于基线,并且大幅稀疏化仍能保持较高的攻击成功率(例如,在Qwen3-Omni上,令牌保留率为0.25时,$\mathrm{ASR}_{l}$仍为86%,而全令牌保留时为87%)。这些结果表明密集的波形更新在很大程度上是冗余的,我们主张未来的音频越狱和安全对齐研究应进一步利用这种异质的令牌级梯度结构。

英文摘要

Jailbreak attacks on audio language models (ALMs) optimize audio perturbations to elicit unsafe generations, and they typically update the entire waveform densely throughout optimization. In this work, we investigate the necessity of such dense optimization by analyzing the structure of token-aligned gradients in ALMs. We find that gradient energy is highly non-uniform across audio tokens, indicating that only a small subset of token-aligned audio regions dominates the optimization signal. Motivated by this observation, we propose Token-Aware Gradient Optimization (TAGO), which enables sparse jailbreak optimization by retaining only waveform gradients aligned with audio tokens that have high gradient energy, while masking the remaining gradients at each iteration. Across three ALMs, TAGO outperforms baselines, and substantial sparsification preserves strong attack success rates (e.g. on Qwen3-Omni, $\mathrm{ASR}_{l}$ remains at 86% with a token retention ratio of 0.25, compared to 87% with full token retention). These results demonstrate that dense waveform updates are largely redundant, and we advocate that future audio jailbreak and safety alignment research should further leverage this heterogeneous token-level gradient structure.

2604.12383 2026-05-26 cs.SD 版本更新

On the Distillation Loss Functions of Speech VAE for Unified Reconstruction, Understanding, and Generation

面向统一重建、理解和生成的语音VAE蒸馏损失函数研究

Changhao Cheng, Wei Wang, Wangyou Zhang, Dongya Jia, Jian Wu, Zhuo Chen, Yanmin Qian

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China(上海交通大学人工智能学院) Auditory Cognition and Computational Acoustics Lab(听觉认知与计算声学实验室) MoE Key Lab of Artificial Intelligence, AI Institute(人工智能MoE重点实验室,AI研究院) School of Computer Science, Shanghai Jiao Tong University, Shanghai, China(上海交通大学计算机科学学院) ByteDance Seed, China(字节跳动种子,中国)

AI总结 本文系统探索了语音VAE中不同对齐方法对重建、理解和生成任务性能的影响,并提出联合边缘对齐与自适应加权策略以实现最优整体性能。

Comments Submitted to Interspeech 2026

详情
AI中文摘要

基于变分自编码器(VAE)的连续语音表示已成为传统频谱图或离散令牌特征在语音生成和重建中的有前途的替代方案。最近的研究试图通过与自监督学习(SSL)特征对齐来丰富VAE潜在表示中的结构信息,旨在获得更好的生成性能。然而,当考虑更多任务时,广泛使用的基于时间轴蒸馏的对齐方法是否最优尚不清楚。针对这一问题,本文系统探索了不同的对齐方法,并分析了它们在重建、理解和生成三个维度上对性能的影响。我们研究了蒸馏损失中的各种设计选择。大量实验表明,采用自适应加权的联合边缘对齐方法可以在实现可控平衡的同时获得最佳整体性能。

英文摘要

Continuous speech representations based on Variational Autoencoders (VAEs) have emerged as a promising alternative to traditional spectrogram or discrete token based features for speech generation and reconstruction. Recent research has tried to enrich the structural information in VAE latent representations by aligning with self-supervised learning (SSL) features, aiming for better generation performance. However, it remains unclear whether the widely-used alignment approach based on time-axis distillation is optimal when considering more tasks. To address this problem, this paper systematically explores different alignment approaches and analyzes their impact on the performances over three axes: reconstruction, understanding, and generation. We investigate various design choices in the distillation loss. Extensive experiments show that the joint-marginal alignment approach with adaptive weighting can achieve the best overall performance while allowing for a controllable balance.

2605.24825 2026-05-26 eess.SP cs.SD cs.SY eess.AS eess.SY math.OC 版本更新

Time Segmented Beamforming via Dynamic Programming: Theory and Implementation

基于动态规划的时间分段波束形成:理论与实现

Manan Mittal, Ryan M. Corey, Diego Cuji, John R. Buck, Andrew C. Singer

发表机构 * Department of Electrical and Computer Engineering, Stony Brook University(石溪大学电气与计算机工程系) Department of Electrical and Computer Engineering, University of Illinois(伊利诺伊大学电气与计算机工程系) Department of Electrical and Computer Engineering, University of Massachusetts Dartmouth(马萨诸塞大学达特茅斯分校电气与计算机工程系) College of Applied Science and Engineering, Stony Brook University(石溪大学应用科学与工程学院)

AI总结 针对时变干扰环境,提出一种基于动态规划的时间分段无失真响应波束形成器,通过数据驱动的自适应分段估计协方差矩阵以跟踪非平稳干扰。

Comments 16 pages, 17 figures, Beamforming New Approach Regret Bounds

详情
AI中文摘要

在具有时变干扰的动态声学环境中,有效的波束形成需要识别随时间变化的平稳区域。Capon波束形成器是一种白化匹配滤波器,约束在期望方向上保持单位增益,理论上依赖于瞬时集合协方差矩阵。实际实现依赖于批量Capon(或样本矩阵求逆),通过对一批快照进行平均来估计样本协方差矩阵(SCM)。这种实用方法隐含假设批处理窗口内的数据是平稳的,可以相干组合。在非平稳环境中,对固定或过长窗口进行平均的批处理方法会失效,因为移动干扰会模糊SCM并降低波束形成器的零陷能力。为解决此问题,本文引入了一种时间分段无失真响应波束形成器。受分段最小二乘法(将分段多项式拟合到数据,同时惩罚过度分段以防止过拟合)的启发,该框架通过引入数据驱动的时间分段扩展了实用的Capon波束形成。该公式在最小化输出功率的同时,动态调整SCM估计窗口以适应局部平稳性,为跟踪时变干扰提供了一种原则性方法。

英文摘要

In dynamic acoustic environments with time-varying interferers, effective beamforming requires identifying stationary regions over time. The Capon beamformer, a whitened matched filter constrained to maintain unity gain in the desired direction, theoretically relies on the instantaneous ensemble covariance matrix. Practical implementations rely on the batch Capon (or Sample Matrix Inversion), which estimates the sample covariance matrix (SCM) by averaging over a block of snapshots. This practical approach implicitly assumes that the data within the batch window is stationary and can be coherently combined. In non-stationary settings, a batch approach that averages over fixed or excessively long windows fails, as moving interferers smear the SCM and degrade the beamformer's nulling capabilities. To address this, this paper introduces a temporally segmented distortionless response beamformer. Inspired by the segmented least squares method, which fits piecewise polynomials to data while penalizing excessive segmentation to prevent overfitting, the framework extends practical Capon beamforming by incorporating data-driven temporal segmentation. This formulation minimizes output power while dynamically adapting the SCM estimation windows to local stationarity, offering a principled approach to tracking time-varying interferers.

2605.24806 2026-05-26 cs.SD cs.AI eess.AS 版本更新

Zero-Shot Parkinson's Disease Detection from Speech: Comparing Large Audio and Language Models

零样本帕金森病语音检测:比较大型音频和语言模型

Muhammad Ashad Kabir, Sirajam Munira

发表机构 * School of Computing, Mathematics and Engineering, Charles Sturt University(计算机科学与工程学院,查尔斯·斯图尔特大学) Department of Computer Science, Rensselaer Polytechnic Institute(计算机科学系,伦塞拉尔理工学院)

AI总结 通过比较手工声学特征和原始音频波形两种输入模态,研究零样本帕金森病检测在不同语言中的性能差异,发现手工特征在低资源语言中更稳定,而音频输入带来数据集依赖的增益。

Comments 6 pages

详情
AI中文摘要

大型音频和语言模型最近在各个领域展示了零样本推理能力。然而,尚不清楚音频输入的形式——无论是从语音中提取的手工声学特征还是原始音频波形——如何影响不同语言中帕金森病(PD)检测的性能。在本研究中,我们系统地比较了两种零样本PD检测的输入模态:(i)由通用LLM分析的从语音记录中提取的手工声学特征,以及(ii)由音频能力模型分析的直接波形输入。在四种语言的PD语音数据集上的实验表明,性能因输入模态、语音任务和语言而异。手工声学特征在低资源语言(例如孟加拉语)中提供更稳定的性能,而音频输入带来数据集依赖的增益。这些发现突显了输入模态对零样本语音PD检测的影响。

英文摘要

Large audio and language models have recently demonstrated zero-shot reasoning capabilities across various domains. However, it remains unclear how the form of audio input, whether handcrafted acoustic features extracted from speech or the raw audio waveform itself, affects performance for Parkinson's disease (PD) detection across different languages. In this study, we systematically compare two input modalities for zero-shot PD detection: (i) handcrafted acoustic features extracted from speech recordings analyzed by a general-purpose LLM, and (ii) direct waveform input analyzed by audio-capable models. Experiments on PD speech datasets in four languages show that performance varies across input modalities, speech tasks, and languages. Handcrafted acoustic features provide more stable performance in a low-resource language (e.g., Bengali), whereas audio input yields dataset-dependent gains. These findings highlight the impact of input modality on zero-shot PD detection from speech.

2605.24652 2026-05-26 cs.AI cs.CV cs.MM cs.SD 版本更新

AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

AVBench:面向音视频生成模型的人类对齐与自动化评估基准

Jialiang Yang, Bin Xia, Ruihang Chu, Dingdong Wang, Wanke Xia, Zhun Mou, Tianyang Zhong, Yiting Zhao, Wenming Yang

发表机构 * Tsinghua University(清华大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出AVBench,通过细粒度人类中心指标和偏好学习训练的专业评估器,实现音视频生成的自动化、准确评估。

详情
AI中文摘要

音视频(AV)生成的快速进步使得能够生成具有同步声音的高保真合成内容,特别是涉及语音和交互的人类相关场景。然而,AV生成的评估仍处于早期阶段,只有少数针对人类相关场景的粗粒度基准,并且依赖于有限的预设评估和通用多模态大语言模型,导致对模型能力的不准确评估。为了解决这些问题,我们引入了AVBench,一个专为人类中心AV生成设计的全自动化基准。AVBench基于两个关键设计以实现全面准确的评估:(i)人类中心和细粒度指标。AVBench整合了十个评估维度,专为以人为中心的现实场景设计,涵盖视觉质量、音频质量以及跨模态的多层次一致性。这些实用指标捕捉了现有基准经常忽略的人类相关细节。(ii)通过偏好学习训练的专业评估器。为了解决缺乏专门训练数据的问题,我们通过将真实视频转化为具有受控扰动的多样化训练对来构建大规模监督。在该高质量数据集上微调后,评估器学会可靠地检测细微的跨模态不一致性。关键的是,AVBench不输出离散的文本判断,而是从模型对二元决策的预测置信度中推导出连续评估分数。这种概率评分机制比传统的VQA风格评估更可靠,并且与人类判断高度一致。综合来看,AVBench为AV生成提供了自动化评估,展示了数据过滤的强大潜力,并可作为来自人类反馈的强化学习(RLHF)的可微分奖励信号。

英文摘要

Rapid advances in audio-video (AV) generation have enabled high-fidelity synthesis with synchronized sound, particularly for human-related scenarios involving speech and interactions. Yet evaluation for AV generation remains at an early stage, with only a few coarse-grained benchmarks for human-related scenarios and relying on limited preset evaluations with generic multimodal LLMs, leading to inaccurate assessments of model capabilities. To address these issues, we introduce AVBench, a fully automated benchmark tailored for human-centric AV generation. AVBench is built on two key designs for comprehensive and accurate evaluation: (i) Human-centric and fine-grained metrics. AVBench integrates ten evaluation dimensions designed for human-centered real-world scenarios, covering visual quality, audio quality, and multi-level consistency across modalities. These practical metrics capture human-related details that existing benchmarks often overlook. (ii) Specialized evaluators via preference learning. To address the lack of specialized training data, we construct large-scale supervision by transforming real-world videos into diverse training pairs with controlled perturbations. After fine-tuning on this high-quality dataset, the evaluators learn to reliably detect subtle cross-modal inconsistencies. Crucially, instead of producing discrete textual judgment, AVBench derives continuous evaluation scores from the model's prediction confidence on binary decisions. This probabilistic scoring mechanism enables a more reliable assessment than traditional VQA-style evaluation and aligns closely with human judgment. Taken together, AVBench offers automated evaluation for AV generation, demonstrates strong potential for data filtering, and serves as a differentiable reward signal for Reinforcement Learning from Human Feedback (RLHF).

2604.19151 2026-05-26 cs.CL cs.SD eess.AS 版本更新

Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

印度之声:面向印度真实世界语音识别的大规模基准

Kaushal Bhogale, Manas Dhir, Amritansh Walecha, Manmeet Kaur, Vanshika Chhabra, Aaditya Pareek, Hanuman Sidh, Mahima Manik, Sagar Jain, Bhaskar Singh, Utkarsh Singh, Tahir Javed, Shobhit Banga, Mitesh M. Khapra

发表机构 * Indian Institute of Technology, Madras, India(印度理工学院,马德拉斯分校) Josh Talks, India(Josh Talks)

AI总结 针对现有Indic ASR基准的局限性,提出基于非脚本电话对话的封闭源基准Voice of India,覆盖15种主要印度语言和139个区域集群,包含306230条语音(536小时),并分析地理、音频质量、语速、性别和设备类型等因素对ASR性能的影响。

Comments 6 pages, 4 figures

详情
AI中文摘要

现有的Indic ASR基准通常使用脚本化的、干净的语音和基于排行榜的评估,这鼓励了针对数据集的过拟合。此外,严格的单参考WER会惩罚印度语言中的自然拼写变体,包括非标准拼写的代码混合英语起源词。为了解决这些局限性,我们引入了Voice of India,这是一个从非脚本电话对话构建的封闭源基准,覆盖15种主要印度语言,跨越139个区域集群。该数据集包含306230条语音,总计536小时的语音,来自36691名说话人,转录考虑了拼写变体。我们还在地理上按地区分析了性能,揭示了差异。最后,我们提供了跨音频质量、语速、性别和设备类型等因素的详细分析,突出了当前ASR系统在哪些方面存在困难,并为改进真实世界的Indic ASR系统提供了见解。

英文摘要

Existing Indic ASR benchmarks often use scripted, clean speech and leaderboard driven evaluation that encourages dataset specific overfitting. In addition, strict single reference WER penalizes natural spelling variation in Indian languages, including non standardized spellings of code-mixed English origin words. To address these limitations, we introduce Voice of India, a closed source benchmark built from unscripted telephonic conversations covering 15 major Indian languages across 139 regional clusters. The dataset contains 306230 utterances, totaling 536 hours of speech from 36691 speakers with transcripts accounting for spelling variations. We also analyze performance geographically at the district level, revealing disparities. Finally, we provide detailed analysis across factors such as audio quality, speaking rate, gender, and device type, highlighting where current ASR systems struggle and offering insights for improving real world Indic ASR systems.

2602.00443 2026-05-26 cs.SD cs.MM eess.AS 版本更新

RVCBench: Benchmarking the Robustness of Voice Cloning Across Modern Audio Generation Models

RVCBench:现代音频生成模型中语音克隆鲁棒性的基准测试

Ruinan Jin, Xinting Liao, Hanlin Yu, Deval Pandya, Xiaoxiao Li

发表机构 * The University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所)

AI总结 提出RVCBench数据集和基准,通过18项鲁棒性评估、225个说话人和14370个话语,系统评估语音克隆模型在噪声、多语言、长文本、后处理和对抗扰动等现实场景下的鲁棒性。

Comments 65 pages, 10 figures

详情
AI中文摘要

现代语音克隆,也称为零样本文本转语音(TTS),可以从仅几秒的参考音频中合成与目标说话人高度匹配的语音,从而支持个性化语音界面和配音等应用。在实践中,这些系统经常面临噪声参考音频、不完美的文本提示、多语言和长文本生成、后处理以及对抗性扰动,所有这些都可能削弱鲁棒性。尽管编解码器令牌语言模型和基于扩散的TTS取得了快速进展,但在现实部署变化下的鲁棒性仍未得到充分探索。本文介绍了RVCBench,一个用于评估语音克隆鲁棒性的综合数据集和基准。RVCBench提供了任务对齐的测试,涵盖受控文本-音频配对、多语言和长文本场景、表达性提示、后处理条件以及被动或主动音频扰动。通过18项鲁棒性评估、225个说话人和14370个话语,RVCBench支持对输入敏感性、生成稳定性、输出弹性、扰动鲁棒性、说话人相似性和深度伪造可检测性的统一评估。我们评估了18个代表性的开源语音克隆模型,并揭示了在内容一致性、说话人相似性、长文本稳定性、后处理弹性、对抗鲁棒性和面向检测器的可分离性方面的系统性漏洞。我们发布代码和数据集,以支持可重复的评估和未来在鲁棒语音克隆、语音合成和音频生成方面的研究。代码:https://github.com/Nanboy-Ronan/RVCBench。数据集:https://huggingface.co/datasets/Nanboy/RVCBench。

英文摘要

Modern voice cloning, also known as zero-shot text-to-speech (TTS), can synthesize speech that closely matches a target speaker from only seconds of reference audio, enabling applications such as personalized speech interfaces and dubbing. In practice, these systems often face noisy reference audio, imperfect text prompts, multilingual and long-form generation, post-processing, and adversarial perturbations, all of which can weaken robustness. Despite rapid progress in codec-token language models and diffusion-based TTS, robustness under realistic deployment shifts remains underexplored. This paper introduces RVCBench, a comprehensive dataset and benchmark for evaluating robustness in voice cloning. RVCBench provides task-aligned tests covering controlled text-audio pairing, multilingual and long-form scenarios, expressive prompts, post-processing conditions, and passive or proactive audio perturbations. Across 18 robustness evaluations, 225 speakers, and 14,370 utterances, RVCBench supports unified evaluation of input sensitivity, generation stability, output resilience, perturbation robustness, speaker similarity, and deepfake detectability. We evaluate 18 representative open-source voice cloning models and reveal systematic vulnerabilities in content consistency, speaker similarity, long-form stability, post-processing resilience, adversarial robustness, and detector-facing separability. We release the code and dataset to support reproducible evaluation and future research on robust voice cloning, speech synthesis, and audio generation. Code: https://github.com/Nanboy-Ronan/RVCBench. Dataset: https://huggingface.co/datasets/Nanboy/RVCBench.

2601.21463 2026-05-26 cs.SD cs.AI 版本更新

Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs

通过先验增强的音频大语言模型统一语音编辑检测与内容定位

Jun Xue, Yi Chai, Yanzhen Ren, Jinshen He, Zhiqiang Tang, Zhuolin Yi, Yihuan Huang, Yuankun Xie, Yujie Chen

发表机构 * Key Laboratory of Aerospace Information Security(航空信息安全与可信计算重点实验室) School of Cyber Science and Engineering(网络安全工程学院) Wuhan University(武汉大学) Independent Researcher(独立研究员) School of Computer Science and Technology(计算机科学与技术学院) Anhui University(安徽大学) Communication University of China(中国通信大学) Beihang University(北京航空航天大学)

AI总结 提出基于音频大语言模型的统一框架,通过生成式方法联合处理语音编辑检测和内容定位,并引入先验增强策略和声学一致性损失以提升性能。

详情
AI中文摘要

现有的语音编辑检测(SED)数据集主要使用手动拼接或有限的编辑操作构建,导致多样性受限且对真实编辑场景的覆盖不足。同时,当前的SED方法严重依赖帧级监督来检测可观察的声学异常,这从根本上限制了它们处理删除型编辑的能力,其中被操纵的内容完全从信号中消失。为了解决这些挑战,我们提出了一个统一框架,通过基于音频大语言模型(Audio LLMs)的生成式公式,将语音编辑检测和内容定位连接起来。我们首先引入了AiEdit(https://huggingface.co/datasets/JunXueTech/AiEdit),这是一个大规模双语数据集(约140小时),使用最先进的端到端语音编辑系统覆盖添加、删除和修改操作,为现代威胁提供了更真实的基准。在此基础上,我们将SED重新定义为结构化文本生成任务,实现了对编辑类型识别和内容定位的联合推理。为了增强生成模型在声学证据中的基础,我们提出了一种先验增强的提示策略,注入从帧级检测器导出的词级概率线索。此外,我们引入了一种声学一致性感知损失,在潜在空间中明确强制正常和异常声学表示之间的分离。实验结果表明,所提出的方法在检测和定位任务上均持续优于现有方法。

英文摘要

Existing speech editing detection (SED) datasets are predominantly constructed using manual splicing or limited editing operations, resulting in restricted diversity and poor coverage of realistic editing scenarios. Meanwhile, current SED methods rely heavily on frame-level supervision to detect observable acoustic anomalies, which fundamentally limits their ability to handle deletion-type edits, where the manipulated content is entirely absent from the signal. To address these challenges, we present a unified framework that bridges speech editing detection and content localization through a generative formulation based on Audio Large Language Models (Audio LLMs). We first introduce AiEdit, https://huggingface.co/datasets/JunXueTech/AiEdit, a large-scale bilingual dataset (approximately 140 hours) that covers addition, deletion, and modification operations using state-of-the-art end-to-end speech editing systems, providing a more realistic benchmark for modern threats. Building upon this, we reformulate SED as a structured text generation task, enabling joint reasoning over edit type identification, and content localization. To enhance the grounding of generative models in acoustic evidence, we propose a prior-enhanced prompting strategy that injects word-level probabilistic cues derived from a frame-level detector. Furthermore, we introduce an acoustic consistency-aware loss that explicitly enforces the separation between normal and anomalous acoustic representations in the latent space. Experimental results demonstrate that the proposed approach consistently outperforms existing methods across both detection and localization tasks.

2601.09931 2026-05-26 cs.SD 版本更新

Diffusion-based Frameworks for Unsupervised Speech Enhancement

基于扩散框架的无监督语音增强

Jean-Eudes Ayilo, Mostafa Sadeghi, Romain Serizel, Xavier Alameda-Pineda

发表机构 * Multispeech team, Université de Lorraine, CNRS, Inria, Loria(多语种团队,洛林大学,国家科学研究中心,法国国家信息与自动化研究所,Loria) RobotLearn team, Université Grenoble Alpes, Inria(机器人学习团队,格勒诺耶阿尔卑斯大学,法国国家信息与自动化研究所)

AI总结 本文提出一种无监督扩散框架,通过联合建模语音和噪声作为潜在变量,在E步中共同采样,并引入基于扩散的噪声模型,显著提升语音增强性能。

详情
AI中文摘要

本文研究基于扩散的无监督单通道语音增强(SE)。先前的工作将基于干净语音训练的分数扩散模型与协方差由非负矩阵分解(NMF)结构化的高斯噪声模型相结合,在迭代期望最大化(EM)方案中使用,其中基于扩散的后验采样E步估计干净语音。我们首先重新审视该框架,提出将语音和声学噪声都显式建模为潜在变量,在E步中联合采样,而不是像先前方法那样仅采样语音。然后,我们引入一个新的半监督SE框架,用基于扩散的噪声模型替换NMF噪声先验,该模型与语音先验在单个条件分数模型中联合学习。在该框架内,我们推导出两种变体:一种隐式处理噪声,另一种显式将噪声视为潜在变量。在WSJ0-QUT和VoiceBank-DEMAND上的实验表明,对于基于NMF和基于扩散的噪声先验,显式噪声建模系统地改善了SE性能。在匹配条件下,基于扩散的噪声模型在无监督方法中达到了最佳的整体质量和可懂度;而在不匹配条件下,所提出的基于NMF的显式噪声框架更加鲁棒,且退化程度低于几个监督基线。代码、演示和补充材料公开可用。

英文摘要

This paper addresses unsupervised diffusion-based single-channel speech enhancement (SE). Prior work in this direction combines a score-based diffusion model trained on clean speech with a Gaussian noise model whose covariance is structured by non-negative matrix factorization (NMF). This combination is used within an iterative expectation-maximization (EM) scheme, in which a diffusion-based posterior-sampling E-step estimates the clean speech. We first revisit this framework and propose to explicitly model both speech and acoustic noise as latent variables, jointly sampling them in the E-step instead of sampling speech alone as in previous approaches. We then introduce a new semi-supervised SE framework that replaces the NMF noise prior with a diffusion-based noise model, learned jointly with the speech prior in a single conditional score model. Within this framework, we derive two variants: one that implicitly accounts for noise and one that explicitly treats noise as a latent variable. Experiments on WSJ0-QUT and VoiceBank-DEMAND show that explicit noise modeling systematically improves SE performance for both NMF-based and diffusion-based noise priors. Under matched conditions, the diffusion-based noise model attains the best overall quality and intelligibility among unsupervised methods, while under mismatched conditions the proposed NMF-based explicit-noise framework is more robust and suffers less degradation than several supervised baselines. Code, demo, and supplementary materials are publicly available.

2510.02171 2026-05-26 cs.SD cs.AI eess.AS 版本更新

Go witheFlow: Real-time Emotion Driven Audio Effects Modulation

Go witheFlow:实时情感驱动音频效果调制

Edmund Dervakos, Spyridon Kantarelis, Vassilis Lyberatos, Jason Liartis, Giorgos Stamou

发表机构 * Artificial Intelligence and Learning Systems Laboratory(人工智能与学习系统实验室) National Technical University of Athens(希腊国家技术大学)

AI总结 提出witheFlow系统,通过生物信号和音频特征实时自动调制音频效果,增强音乐表演中的人机协作。

Comments Accepted at NeurIPS Creative AI Track 2025: Humanity

详情
AI中文摘要

音乐表演是一种独特的人类活动,与表演者传达、唤起或表达情感的能力内在相关。机器无法以人类的意义表演音乐;它们可以制作、复制、执行或合成音乐,但缺乏情感或情绪体验的能力。因此,音乐表演是探索人机协作方面的理想候选。在本文中,我们介绍了witheFlow系统,旨在通过基于从生物信号和音频本身提取的特征自动调制音频效果,增强实时音乐表演。该系统目前处于概念验证阶段,设计轻量,能够在笔记本电脑上本地运行,并且在兼容的数字音频工作站和传感器可用的情况下是开源的。

英文摘要

Music performance is a distinctly human activity, intrinsically linked to the performer's ability to convey, evoke, or express emotion. Machines cannot perform music in the human sense; they can produce, reproduce, execute, or synthesize music, but they lack the capacity for affective or emotional experience. As such, music performance is an ideal candidate through which to explore aspects of collaboration between humans and machines. In this paper, we introduce the witheFlow system, designed to enhance real-time music performance by automatically modulating audio effects based on features extracted from both biosignals and the audio itself. The system, currently in a proof-of-concept phase, is designed to be lightweight, able to run locally on a laptop, and is open-source given the availability of a compatible Digital Audio Workstation and sensors.

2605.24291 2026-05-26 cs.SD cs.CL cs.MM 版本更新

Rubato: Transcribing Piano Music with Timestamps

Rubato: 带时间戳的钢琴音乐转录

Nazif Can Tamer, Victoria Ebert, Guang Yang, Noah A. Smith

发表机构 * Paul G. Allen School of Computer Science & Engineering, University of Washington(保罗·G·阿伦计算机科学与工程学院,华盛顿大学) Allen Institute for AI(阿伦人工智能研究所)

AI总结 提出一个名为Rubato的提示条件编码器-解码器模型,结合新的多声部音乐文本表示InterMo,实现从音频生成带时间戳的钢琴乐谱,在记谱准确性上优于现有级联方法。

Comments 18 pages, 7 figures, 5 tables

详情
AI中文摘要

我们考虑将音乐录音转换为带时间戳的人类可读乐谱。这样的输出让听众能够清晰地可视化rubato(时间表达性演奏),学习者能够诊断合奏精度和与书面音乐相比的时间选择,音乐学学者能够比较同一作品不同录音的演奏风格。我们引入了(1)一个名为Rubato的提示条件编码器-解码器模型,训练输出(2)一种新的多声部音乐文本表示,名为InterMo,我们设计其与序列到序列训练兼容。我们的实验表明,Rubato从音频生成带时间戳的钢琴乐谱,其记谱准确性优于基于级联的最佳现有方法。我们发现,即使级联方法获得真实MIDI而非音频,Rubato的表现仍然更好,这表明现有方法的上限主要是表示性的,而非声学性的。此外,由于Rubato在多个相关任务(带提示)上训练,它在相关但更简单的任务(如MIDI音符定位和节拍/强拍检测)上与最佳单任务系统竞争或超越它们。演示可在https://nctamer.github.io/rubato-transcription 获取。

英文摘要

We consider the conversion of musical recordings into human-readable sheet music annotated with timestamps. Such output lets a listener clearly visualize rubato (temporally expressive playing), a learner diagnose ensemble precision and timing choices against the written music, and a musicology scholar compare performance styles across recordings of the same work. We introduce (1) a prompt-conditioned encoder-decoder model, named Rubato, trained to output (2) a new textual representation for polyphonic music, named InterMo, which we designed for compatibility with sequence-to-sequence training. Our experiments demonstrate that Rubato produces timestamped piano sheet music from audio with higher notational accuracy than the best existing approaches, which are based on cascades. We find that even if the cascade is given ground-truth MIDI instead of audio, Rubato performs better, suggesting that the ceiling of existing approaches is primarily representational, not acoustic. Further, because Rubato is trained on several related tasks (with prompts), it competes with or outperforms the best single-task systems on related but simpler tasks like MIDI note grounding and beat/downbeat detection. A demo is available at https://nctamer.github.io/rubato-transcription .

2605.24193 2026-05-26 cs.SD cs.LG 版本更新

Music Transcription with (Almost) No Supervision

音乐转录:几乎无需监督

Saebyeol Shin, Chao Wan, Zhenzhen Liu, Justin Lovelace, Daniel C. Lin, Kilian Q. Weinberger, John Thickstun

发表机构 * Cornell University(康奈尔大学)

AI总结 采用循环一致性翻译框架,利用少量配对数据作为锚点,充分挖掘未配对音频和乐谱数据,实现高质量音乐转录。

详情
AI中文摘要

竞争性的音乐转录模型需要大量的配对音频-乐谱数据,但由于收集成本、对齐困难和版权限制,这类数据稀缺。与此同时,大量未配对的音频录音和符号乐谱可免费获取,但未被利用。我们采用循环一致性翻译框架,其中少量配对数据作为最小锚点,释放未配对数据池的全部潜力。我们发现:未配对数据带来惊人的提升,尤其在有限监督下;未配对音频比未配对乐谱贡献更大;在训练中引入新乐器的未标注音频,可在无需任何配对监督的情况下改善该乐器的转录。这些结果共同表明,扩展未配对数据为标注数据仍然稀缺的乐器提供了一条实现高质量转录的实用途径。

英文摘要

Competitive music transcription models require large amounts of paired audio-score data, which is scarce due to collection costs, alignment difficulty, and copyright restrictions. Meanwhile, vast quantities of unpaired audio recordings and symbolic scores are freely available but have gone unused. We adopt a cycle-consistent translation framework in which a small amount of paired data acts as a minimal anchor, unlocking the full potential of the unpaired pool. We find that: unpaired data yields surprisingly large gains, especially under limited supervision; unpaired audio contributes more than unpaired scores; incorporating unlabeled audio from a new instrument during training improves transcription for that instrument without any paired supervision. Together, these results suggest that scaling unpaired data offers a practical path toward high-quality transcription for instruments where labeled data remains scarce.

2605.23982 2026-05-26 cs.SD 版本更新

PiAnnotate: A Web Annotation Tool for Piano Fingering, with a Diagnostic Probe

PiAnnotate: 一个用于钢琴指法的网页标注工具,附带诊断探针

Joonhyung Bae, Kirak Kim, Hyeyoon Cho, Sein Lee, Yoon-Seok Choi, Hyeon Hur, Gyubin Lee, Akira Maezawa, Jonghwa Park, Jaebum Park, Juhan Nam

发表机构 * GitHub

AI总结 提出基于网页的钢琴指法标注工具PiAnnotate,结合钢琴卷帘视图、演奏视频和3D手部网格,通过保留规则生成与人工编辑的配对指法轨迹实现标注可审计性,并训练小型Transformer探针从编辑标签中学习可改进的结构。

详情
AI中文摘要

钢琴指法决定了如何演奏一个乐段,但在演奏后标注指法却很困难。标注者必须决定每个音符由哪个手指弹奏,同时协调乐谱、时间、视频和手部运动。我们提出了PiAnnotate,一个基于网页的流程,用于为FurElise演奏数据集添加专家指法标注。该工具结合了钢琴卷帘视图、演奏视频和3D MANO手部网格,使审阅者能够在音乐和物理上下文中检查每个指法分配。PiAnnotate不仅存储最终答案,还保留配对的基于规则和人工编辑的指法轨迹。这些配对轨迹通过显示几何规则何时足够、专家何时干预以及标签在审查轮次中如何变化,使标注历史可审计。作为最终诊断,我们在配对轨迹上训练了一个小型Transformer探针。该探针在保留曲目上优于规则基线,同时对于已正确的标签保持保守更改,表明编辑后的标签包含可学习的结构,而不仅仅是孤立的修正。

英文摘要

Piano fingering shapes how a passage can be played, yet it is difficult to label after a performance. An annotator must decide which finger produced each note while reconciling the score, timing, video, and hand motion. We present PiAnnotate, a web-based pipeline for adding expert fingering annotations to the FurElise performance dataset. The tool brings together a piano-roll view, performance video, and a 3D MANO hand mesh so that reviewers can inspect each assignment in musical and physical context. Rather than storing only the final answer, PiAnnotate keeps paired rule-based and human-edited fingering tracks. These paired tracks make the annotation history auditable by showing where a geometric rule was sufficient, where experts intervened, and how labels changed across review passes. As a final diagnostic, we train a small Transformer probe on the paired tracks. The probe improves on the rule baseline on held-out pieces while remaining conservative about changing labels that were already correct, suggesting that the edited labels contain learnable structure rather than only isolated fixes.

2605.23977 2026-05-26 cs.CL cs.SD eess.AS 版本更新

A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks

临床访谈抑郁症检测基准的多探针审计

Takehiro Ishikawa, Jon Duke

发表机构 * College of Computing, Georgia Institute of Technology(佐治亚理工学院计算机学院) Georgia Tech Research Institute, Georgia Institute of Technology(佐治亚理工学院研究 institute)

AI总结 通过四个互补探针审计临床访谈抑郁症检测基准,发现评估协议缺陷、排行榜不可靠、跨域泛化弱以及文本与音频模态对症状密度的敏感性差异。

详情
AI中文摘要

本文通过四个互补探针对 DAIC/E-DAIC、CMDC、ANDROIDS、MODMA 和 PDCH 中的临床访谈抑郁症检测基准评估进行审计。首先,我们在严格的受试者不相交留一受试者交叉验证下重新评估 E-DAIC。一个轻量级混合文本加 LLM 评分模型达到了 macro-F1 = 0.723——据我们所知,这是该协议下报告的最高值——提供了一个不依赖特权官方保留集的保守出折参考点。其次,我们通过扫描 96 种跨模态组合、池化策略和学习器的模型配置,测试 E-DAIC 官方划分是否支持细粒度排行榜排名。开发侧交叉验证与官方测试排名仅中等程度对齐:最佳交叉验证配置在官方测试中排名第 20,官方测试获胜者按交叉验证排名第 41,前三名重叠为零,且表观获胜者在仅 32.3% 的受试者自举中排名第一。第三,我们外部验证了强大的公开 CMDC 和 ANDROIDS 基线,这些基线在域内实现了接近天花板的表现。到外部语料库的零样本迁移明显较弱。最后,我们使用基于 SRDS 的标注器定义的症状密集与症状稀疏的配对访谈片段,对 E-DAIC 文本和音频模型进行压力测试。文本分数在症状密集片段上急剧上升,而音频分数几乎持平;文本减音频的差距在所有五个种子上均为正。

英文摘要

This paper audits benchmark evaluation in clinical-interview depression detection through four complementary probes across DAIC/E-DAIC, CMDC, ANDROIDS, MODMA, and PDCH. First, we re-evaluate E-DAIC under strict subject-disjoint leave-one-subject-out cross-validation. A lightweight hybrid text-plus-LLM-score model reaches macro-F1 = 0.723 - the highest reported under this protocol, to our knowledge - providing a conservative out-of-fold reference point that does not depend on the privileged official holdout. Second, we test whether the E-DAIC official split supports fine-grained leaderboard rankings by sweeping 96 model configurations across modality bundles, pooling strategies, and learners. Development-side cross-validation and official-test rankings align only moderately: the best cross-validation configuration ranks twentieth on the official test, the official-test winner ranks forty-first by cross-validation, top-3 overlap is zero, and the apparent winner is rank-1 in only 32.3% of subject bootstraps. Third, we externally validate strong public CMDC and ANDROIDS baselines that achieve near-ceiling in-domain performance. Zero-shot transfer to external corpora is substantially weaker. Finally, we stress-test E-DAIC text and audio models using paired symptom-dense versus symptom-light interview slices defined by an SRDS-based annotator. Text scores rise sharply on symptom-dense slices, whereas audio scores remain nearly flat; the text-minus-audio gap is positive across all five seeds.

2605.23975 2026-05-26 cs.CL cs.SD 版本更新

Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs

面向音频大语言模型中英双语码转换语音识别的直接偏好优化

Trung Nguyen Quang, Cheng Yi Lewis Won, Minh Duc Pham, Yingxu He, Shuo Sun, Ai Ti Aw

发表机构 * Institute for Infocomm Research (I2R), A STAR, Singapore(信息通信研究所(I2R),A STAR,新加坡) Nanyang Technological University, Singapore(南洋理工大学,新加坡)

AI总结 针对音频大语言模型在英中码转换语音转录中的系统失败,提出使用直接偏好优化(DPO)对齐模型,通过构建偏好对(保留混合语言内容 vs 模仿失败模式)训练模型,实现词错误率降低最高89.6%(分布内)和20.0%(分布外)。

详情
AI中文摘要

音频大语言模型(Audio LLMs)尽管具有强大的多语言能力,但在转录码转换语音时表现出系统性失败。聚焦英中双语,我们识别出三种失败模式:语言省略、翻译替代转录和幻觉。我们应用直接偏好优化(DPO)来对齐模型,构建偏好对,其中选择响应保留混合语言内容,而拒绝响应模仿失败模式。在100K对(570小时)上训练三个Audio LLMs,我们观察到一致的行为转变:模型学会在提示转录时保留语言组成而非翻译。这种对齐使得词错误率降低高达89.6%(分布内)和20.0%(分布外)。我们的发现表明,DPO可以有效地从多语言Audio LLMs中引出正确的码转换转录行为。

英文摘要

Audio large language models (Audio LLMs) exhibit systematic failures in transcribing code-switching speech despite strong multilingual capabilities. Focusing on English-Mandarin, we identify three failure modes: language omission, translation-instead-of-transcription, and hallucination. We apply Direct Preference Optimization (DPO) to align models, constructing preference pairs in which chosen responses preserve mixed-language content while rejected responses mimic failure patterns. Training three Audio LLMs on 100K pairs (570 hours), we observe consistent behavioral shifts: models learn to preserve language composition rather than translating when prompted for transcription. This alignment yields MER reductions up to 89.6% (in-distribution) and 20.0% (out-of-distribution). Our findings suggest DPO can effectively elicit correct code-switching transcription behavior from multilingual Audio LLMs.

2605.23954 2026-05-26 cs.CL cs.AI cs.SD 版本更新

EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

EchoDistill:面向鲁棒音频大语言模型的噪声到干净自蒸馏对齐

Liang Lin, Chunxi Luo, Kaiwen Luo, Jie Zhang, Jin Wang, Yuanhe Zhang, Cai Yuchen, Qiankun Li, Gongli Xi, Zhenhong Zhou, Kun Wang, Junhao Dong

发表机构 * NTU(国立台湾大学) SHU(上海大学) ICT, CAS(中国科学院信息科技研究院) HDU(华中科技大学) BUPT(北京邮电大学) USTC(中国科学技术大学) SKL-NST, BUPT(北京邮电大学国家智能计算研究中心)

AI总结 提出EchoDistill框架,通过冻结的干净音频教师模型指导噪声学生模型进行组相对策略优化,实现噪声到干净的自蒸馏对齐,提升音频大语言模型在复杂噪声下的语义可靠性和任务性能。

详情
AI中文摘要

音频大语言模型极易受到现实世界噪声的影响,常常导致严重的语义漂移和幻觉。现有的鲁棒性方法主要依赖于波形级声学增强、答案级监督或噪声表示的内部抑制。为了解决这些问题,我们提出了EchoDistill,一种基于对齐的噪声到干净自蒸馏框架。EchoDistill利用冻结的干净音频教师模型为推理时的噪声音频学生模型提供语义参考。具体地,学生模型在噪声条件下采样候选响应以暴露其测试时行为。这些轨迹随后通过组相对策略优化进行优化,其中与教师模型的令牌级一致性作为奖励加成。通过将噪声学生模型的候选响应与干净语义证据对齐,并应用音频感知奖励塑造,我们的方法鼓励既正确又真正基于声学推理的轨迹。EchoDistill显著提高了音频大语言模型在复杂噪声下的语义可靠性和任务性能,且不引入任何额外推理成本。大量实验表明:(I) 与最强基线相比,EchoDistill在强噪声下GSR平均提升4.18%↑。(II) 在Qwen-Omni上的消融结果进一步显示,EchoDistill相比仅GRPO变体在Acc上平均提升3.02%↑,在Noisy上提升3.89%↑,在GSR上提升4.53%↑。我们的代码可在https://anonymous.4open.science/r/echodistill-10DE获取。

英文摘要

Audio Large Language Models (ALLMs) are highly vulnerable to real-world noise, which often induces severe semantic drift and hallucinations. Existing robustness methods primarily rely on waveform-level acoustic enhancement, answer-level supervision, or the internal suppression of noise representations. To address these issues, we propose echodistill, an alignment-based noisy-to-clean self-distillation framework. Echodistill leverages a frozen clean-audio teacher to provide semantic references for an inference-time noisy-audio student. Specifically, the student samples candidate responses under noisy conditions to expose its test-time behavior. These trajectories are then optimized via group-relative policy optimization (GRPO), where the token-level consistency with the teacher acts as a reward bonus. By aligning the noisy student's candidate responses with clean semantic evidence, and applying audio-aware reward shaping, our method encourages reasoning trajectories that are both correct and genuinely acoustically grounded. Echodistill significantly improves the semantic reliability and task performance of Audio LLMs under complex noise, without introducing any additional inference costs. Extensive experiments show that: (I) Compared with the strongest baseline, echodistill achieves average improvements of 4.18\%$\uparrow$ in GSR under strong noise. (II) Ablation results on Qwen-Omni further show that echodistill improves over the GRPO-only variant by 3.02\%$\uparrow$ in Acc, 3.89\%$\uparrow$ in Noisy, and 4.53\%$\uparrow$ in GSR on average. Our codes are available at https://anonymous.4open.science/r/echodistill-10DE.

2605.23912 2026-05-26 cs.CL cs.AI cs.SD 版本更新

Raon-Speech Technical Report

Raon-Speech 技术报告

Beomsoo Kim, Changho Choi, Dohyun Kim, Dongki Lee, Ethan Ewer, Eunchong Kim, Gyeongman Kim, Haechan Kim, Hyeonghwan Kim, Inkyu Park, Jihun Yun, Jihwan Moon, Jiyun Kim, Joonghyun Bae, Junhyuck Kim, Minkyu Kim, Sehun Lee, Seungjun Chung, Sungwoo Cho, Dongmin Park, Dongwon Kim, Hara Kang, Jonghyun Lee, Keon Lee, Kangwook Lee, Jaewoong Cho

发表机构 * KRAFTON

AI总结 本文提出 Raon-Speech,一个 9B 参数的语音语言模型,通过多阶段训练实现英语和韩语的语音理解、回答与生成,并扩展为全双工对话模型 Raon-SpeechChat,在语音任务上超越同类模型。

详情
AI中文摘要

我们提出了 Raon-Speech,一个在英语和韩语语音理解、回答和生成方面表现优异的 9B 参数语音语言模型(SpeechLM),以及 Raon-SpeechChat,一个用于自然实时对话的高性能全双工扩展。Raon-Speech 成功地将预训练的大语言模型(LLM)转换为既能理解又能生成语音的 SpeechLM,同时保留了强大的文本能力。它在 138 万小时精心策划的英语和韩语语音及文本数据集上训练,训练阶段包括:(1) 语音模块对齐,(2) 基于知识蒸馏的端到端 SpeechLM 预训练,以及 (3) 基于多任务偏好优化的后训练。在 42 个英语和韩语语音及文本基准测试中,与包括 Qwen2.5-Omni 和 Fun-Audio-Chat 在内的八个近期类似规模的音频基础模型相比,Raon-Speech 在语音中心任务上建立了最强的整体表现,同时保留了强大的文本问答性能。在此基础上,Raon-SpeechChat 通过在 119K 小时的时间对齐的真实和合成对话数据上进行持续训练,实现了自然的全双工对话。它通过三个互补的训练阶段进行:(1) 因果编码器适应,(2) 全双工预训练,(3) 用于语音和角色控制的全双工微调。在多个全双工基准测试中,Raon-SpeechChat 在 FDB v1.0 涵盖的轮流发言和中断敏感行为上显示出最明显的优势,并在更广泛的全双工评估套件中保持竞争力。我们开源了所有模型检查点、训练和推理流程以及交互式演示。

英文摘要

We present Raon-Speech, a top-performing 9B-parameter speech language model (SpeechLM) for English and Korean speech understanding, answering, and generation, and Raon-SpeechChat, a high-performing full-duplex extension for natural real-time conversation. Raon-Speech successfully transforms a pre-trained LLM into a SpeechLM that both understands and generates speech while preserving strong text capabilities. It trains on 1.38M hours of highly curated English and Korean speech and text datasets with the following training stages: (1) speech modules alignment, (2) end-to-end SpeechLM pre-training with knowledge distillation, and (3) multi-task preference optimization-based post-training. Across 42 English and Korean speech and text benchmarks, Raon-Speech establishes the strongest overall profile on speech-centric tasks in our comparison against eight similarly sized recent audio foundation models, including Qwen2.5-Omni and Fun-Audio-Chat, while preserving strong text question answering performance. Building upon it, Raon-SpeechChat enables natural full-duplex conversation by continual training on 119K hours of time-aligned real and synthetic dialogue data. It proceeds through three complementary training stages: (1) causal encoder adaptation, (2) full-duplex pre-training, (3) full-duplex fine-tuning for voice and role-control. On multiple full-duplex benchmarks, Raon-SpeechChat shows its clearest strengths on the turn-taking and interruption-sensitive behaviors covered by FDB v1.0, and remains competitive across the broader full-duplex evaluation suite. We open-source all model checkpoints, the training and inference pipeline, and an interactive demo.