RVCBench: Benchmarking the Robustness of Voice Cloning Across Modern Audio Generation Models
RVCBench:现代音频生成模型中语音克隆鲁棒性的基准测试
Ruinan Jin, Xinting Liao, Hanlin Yu, Deval Pandya, Xiaoxiao Li
AI总结 提出RVCBench数据集和基准,通过18项鲁棒性评估、225个说话人和14370个话语,系统评估语音克隆模型在噪声、多语言、长文本、后处理和对抗扰动等现实场景下的鲁棒性。
Comments 65 pages, 10 figures
详情
现代语音克隆,也称为零样本文本转语音(TTS),可以从仅几秒的参考音频中合成与目标说话人高度匹配的语音,从而支持个性化语音界面和配音等应用。在实践中,这些系统经常面临噪声参考音频、不完美的文本提示、多语言和长文本生成、后处理以及对抗性扰动,所有这些都可能削弱鲁棒性。尽管编解码器令牌语言模型和基于扩散的TTS取得了快速进展,但在现实部署变化下的鲁棒性仍未得到充分探索。本文介绍了RVCBench,一个用于评估语音克隆鲁棒性的综合数据集和基准。RVCBench提供了任务对齐的测试,涵盖受控文本-音频配对、多语言和长文本场景、表达性提示、后处理条件以及被动或主动音频扰动。通过18项鲁棒性评估、225个说话人和14370个话语,RVCBench支持对输入敏感性、生成稳定性、输出弹性、扰动鲁棒性、说话人相似性和深度伪造可检测性的统一评估。我们评估了18个代表性的开源语音克隆模型,并揭示了在内容一致性、说话人相似性、长文本稳定性、后处理弹性、对抗鲁棒性和面向检测器的可分离性方面的系统性漏洞。我们发布代码和数据集,以支持可重复的评估和未来在鲁棒语音克隆、语音合成和音频生成方面的研究。代码:https://github.com/Nanboy-Ronan/RVCBench。数据集:https://huggingface.co/datasets/Nanboy/RVCBench。
Modern voice cloning, also known as zero-shot text-to-speech (TTS), can synthesize speech that closely matches a target speaker from only seconds of reference audio, enabling applications such as personalized speech interfaces and dubbing. In practice, these systems often face noisy reference audio, imperfect text prompts, multilingual and long-form generation, post-processing, and adversarial perturbations, all of which can weaken robustness. Despite rapid progress in codec-token language models and diffusion-based TTS, robustness under realistic deployment shifts remains underexplored. This paper introduces RVCBench, a comprehensive dataset and benchmark for evaluating robustness in voice cloning. RVCBench provides task-aligned tests covering controlled text-audio pairing, multilingual and long-form scenarios, expressive prompts, post-processing conditions, and passive or proactive audio perturbations. Across 18 robustness evaluations, 225 speakers, and 14,370 utterances, RVCBench supports unified evaluation of input sensitivity, generation stability, output resilience, perturbation robustness, speaker similarity, and deepfake detectability. We evaluate 18 representative open-source voice cloning models and reveal systematic vulnerabilities in content consistency, speaker similarity, long-form stability, post-processing resilience, adversarial robustness, and detector-facing separability. We release the code and dataset to support reproducible evaluation and future research on robust voice cloning, speech synthesis, and audio generation. Code: https://github.com/Nanboy-Ronan/RVCBench. Dataset: https://huggingface.co/datasets/Nanboy/RVCBench.