Dual-Branch Gated Fusion for Open-Set Audio Deepfake Source Tracing
双分支门控融合用于开放集音频深度伪造源追踪
Awais Khan, Kutub Uddin, Khalid Malik
AI总结 针对开放集音频深度伪造源追踪问题,提出双分支门控融合框架,结合XLSR-53和CORES描述符,通过输入条件门控自适应加权,实现域内高精度和域外鲁棒泛化。
详情
将合成语音归因于其原始系统仍然是一个开放挑战:闭集模型无法拒绝未见过的合成器并产生过度自信的预测。为了解决这个问题,我们提出了一个双分支门控融合框架,将XLSR-53与CORES配对,CORES是一个66维描述符,与之前仅使用线性滤波器组(LFB)的工作不同,它跨越倒谱、振荡、节奏、能量和频谱维度,以捕获互补的合成伪影。我们的分析表明,XLSR-53在域内(ID)保持判别性,而CORES在分布偏移(OOD)下稳定泛化,但由于SSL表示不平衡,它们的简单拼接失败。为了解决这个问题,一个输入条件门控在联合训练下自适应地加权每个分支,使用交叉熵、用于ID/OOD分离的能量边际损失和门控多样性项。在MLAAD基准上,我们的系统实现了97.6%的ID准确率、4.9%的EERc,并且相对于Interspeech 2025基线,FPR95相对降低了83.5%。
Attributing a synthetic utterance to its originating system remains an open challenge: closed-set models fail to reject unseen synthesizers and produce overconfident predictions. To address this, we propose a dual-branch gated fusion framework that pairs XLSR-53 with CORES, a 66-dimensional descriptor that, unlike prior Linear Filter Bank (LFB)-only work, spans cepstral, oscillatory, rhythmic, energy, and spectral dimensions to capture complementary synthesis artifacts. Our analysis shows XLSR-53 remains discriminative in-domain (ID) while CORES generalizes stably under distribution shift (OOD), yet their naive concatenation fails due to SSL representational imbalance. To resolve this, an input-conditioned gate adaptively weights each branch under joint training with cross-entropy, an energy margin loss for ID/OOD separation, and a gate diversity term. On the MLAAD benchmark, our system achieves 97.6\% ID accuracy, 4.9\% EERc, and an 83.5\% relative FPR95 reduction over the Interspeech 2025 baseline.