LoSATok: Low-dimensional Semantic-Acoustic Tokenizer for Cross-Domain Audio Understanding and Generation
LoSATok: 用于跨域音频理解与生成的低维语义-声学分词器
Zhisheng Zhang, Xiang Li, Yixuan Zhou, Jing Peng, Guoyang Zeng, Zhiyong Wu
AI总结 提出低维音频分词器LoSATok,通过语义瓶颈压缩和双级语义监督,在紧凑潜空间中联合捕获语义和声学细节,提升扩散Transformer的生成性能。
详情
音频分词器是统一音频理解和生成的基础。理解需要高层语义,而生成需要语义和声学细节。现有的统一分词器将两者共同编码到高维连续潜变量中,这增加了扩散Transformer(DiT)的建模负担。我们提出LoSATok,一种用于跨域音频理解和生成的低维音频分词器。受1280维语义编码器特征可压缩的观察启发,我们引入语义瓶颈(Semantic Bottleneck),将其压缩到128维,并通过提出的时间关系损失(time-relation loss)正则化以实现时间特征一致性。我们进一步设计了一种双级语义监督方法,利用高维和低维语义信号,使分词器能够在紧凑的潜空间中联合捕获语义和声学细节。在语音、音乐和通用音频上的实验表明,SemBo保持了强大的低维语义能力,LoSATok与几种语义表示相比保持了有竞争力的理解性能,同时在语音、音乐和音频生成上持续提升了DiT的建模性能。这些结果表明,LoSATok的低维表示能够有效支持音频理解和生成。我们的代码提供在https://github.com/wxzyd123/LoSATok。
Audio tokenizers are fundamental to unifying audio understanding and generation. Understanding requires high-level semantics, while generation demands semantic and acoustic details. Existing unified tokenizers jointly encode both in high-dimensional continuous latents, which increases the modeling burden of Diffusion Transformers (DiTs) for generation. We propose LoSATok, a low-dimensional audio tokenizer for cross-domain audio understanding and generation. Motivated by the observation that 1280-dimensional semantic encoder features are compressible, we introduce a Semantic Bottleneck that compresses them into 128 dimensions, regularized by the proposed time-relation loss for temporal feature consistency. We further design a dual-level semantic supervision method that leverages both high- and low-dimensional semantic signals, enabling the tokenizer to jointly capture semantics and acoustic details within a compact latent space. Experiments on speech, music, and general audio show that SemBo preserves strong low-dimensional semantic capacity and LoSATok retains competitive understanding performance compared with several semantic representations, while consistently improving DiT modeling performance on speech, music, and audio generation. These results demonstrate that LoSATok's low-dimensional representations can effectively support audio understanding and generation. Our code is provided at https://github.com/wxzyd123/LoSATok.