arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.03455 2026-06-03 eess.AS cs.SD 版本更新

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

WavTTS:通过直接原始波形建模实现高质量零样本TTS

Wenxi Chen, Dongya Jia, Yushen Chen, Zhikang Niu, Yuzhe Liang, Xiquan Li, Ruiqi Yan, Ziyang Ma, Guanrou Yang, Sanyuan Chen, Yue Wang, Zhuo Chen, Kai Yu, Xie Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) ByteDance Seed(字节跳动种子)

AI总结 提出WavTTS,首个基于流匹配与扩散Transformer的原始波形生成TTS模型,通过简单分块策略直接建模波形并集成多尺度梅尔频谱监督,在零样本TTS中接近潜在空间生成模型性能。

详情
AI中文摘要

最近,基于VAE潜在变量或梅尔频谱的扩散模型已成为零样本TTS的主流范式。尽管这些压缩表示提高了生成效率,但它们不可避免地遭受信息损失和非端到端训练的问题。理论上,直接建模原始波形可以规避这些问题;然而,由于音频信号序列长度极长,这一方向尚未充分探索且常被认为困难。为了克服这一点,我们提出了WavTTS,这是第一个原始波形生成TTS模型,显著缩小了与潜在空间生成模型的差距。基于流匹配与扩散Transformer(DiT),WavTTS通过简单的分块策略直接建模语音波形,同时集成多尺度梅尔频谱监督以在训练过程中提供感知指导。此外,我们研究了波形扩散中预测目标和噪声调度的影响,并开发了一种有效的调度设计以提高生成质量。在开源基准上的评估表明,WavTTS接近当前最先进的潜在生成零样本TTS模型的性能,同时显著优于之前的端到端语音生成模型。我们的发现证明了直接在波形空间扩展基于扩散的TTS的可行性,为端到端语音生成开辟了新方向。

英文摘要

Recently, diffusion models operating on VAE latents or mel-spectrograms have become the dominant paradigm for zero-shot TTS. Although these compressed representations improve generation efficiency, they inevitably suffer from information loss and non-end-to-end training. Theoretically, directly modeling raw waveforms circumvents these issues; however, this direction remains underexplored and is often deemed difficult due to the extremely long sequence length of audio signals. To overcome this, we propose WavTTS, the first raw waveform generative TTS model that substantially narrows the gap with latent-space generative models. Built upon the flow matching with Diffusion Transformer (DiT), WavTTS directly models speech waveforms via a simple patchification strategy, while integrating multi-scale mel-spectrogram supervision to provide perceptual guidance during training. Furthermore, we investigate the impact of prediction targets and noise scheduling in waveform diffusion, and develop an effective schedule design to improve generation quality. Evaluations on open-source benchmarks demonstrate that WavTTS closely approaches the performance of current state-of-the-art latent generative zero-shot TTS models, while substantially outperforming previous end-to-end speech generation models. Our findings demonstrate the feasibility of scaling diffusion-based TTS directly in the waveform space, opening a new direction for end-to-end speech generation.

2606.03116 2026-06-03 eess.AS cs.AI cs.SD 版本更新

AnyAudio-Judge: A Dynamic Rubric-Based Benchmark and Evaluator for Audio Instruction Following

AnyAudio-Judge:基于动态评分标准的音频指令跟随基准与评估器

Haitao Li, Tian Tan, Yuguang Yang, Shan Yang, Xie Chen

发表机构 * Zhejiang University(浙江大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Jiao Tong University(上海交通大学) Tencent Hunyuan(腾讯文脉)

AI总结 针对指令引导音频生成中复杂指令解耦困难、评估缺乏可解释性和细粒度属性匹配的问题,提出基于动态评分标准的评估范式,通过自适应分解音频描述为可验证的二元评分项,并构建包含7920个样本的双语基准和105K训练语料,结合SFT与GRPO训练专用评估器,在零样本对齐检测和下游强化学习指令对齐中取得显著提升。

详情
AI中文摘要

指令引导音频生成的快速发展凸显了对稳健对齐评估的迫切需求。当前的自动评估方法严重依赖通用大语言模型的整体评分,难以解耦复杂指令,缺乏可解释性,且无法捕捉细粒度的属性不匹配。为解决这一问题,我们引入了一种新颖的基于动态评分标准的评估范式,该范式自适应地将复杂的音频描述分解为可变数量的独立、可验证的二元评分项。为了严格基准测试这一能力,我们提出了AnyAudio-Judge Bench,一个全面的双语基准,包含7920个精心策划的样本,涵盖四个不同的音频领域(语音、声音、音乐和混合),并包含特意构建的困难负样本。此外,我们构建了一个包含105K样本的大规模语料库,并带有明确的思维链(CoT)理由,以训练我们的专用评估器——AnyAudio-Judge模型。通过采用结合监督微调(SFT)和组相对策略优化(GRPO)的训练流程,我们的模型成功将其推理路径与基于评分标准的评分机制对齐。大量实验表明,AnyAudio-Judge不仅显著增强了与最先进基线相比的零样本对齐检测,而且提供了精确且可解释的奖励信号,显著改善了音频生成下游强化学习中的指令对齐。

英文摘要

The rapid advancement of instruction-guided audio generation has highlighted the critical need for robust alignment evaluation. Current automated evaluation methods heavily rely on holistic scoring from general-purpose large language models, which struggle to decouple complex instructions, lack interpretability, and fail to capture fine-grained attribute mismatches. To address this, we introduce a novel dynamic rubric-based evaluation paradigm that adaptively decomposes complex audio captions into a variable number of independent, verifiable binary rubric items. To rigorously benchmark this capability, we propose the AnyAudio-Judge Bench, a comprehensive, bilingual benchmark comprising 7,920 meticulously curated samples across four diverse audio domains (speech, sound, music, and mixed), featuring deliberately constructed hard negatives. Furthermore, we construct a large-scale corpus of 105K samples with explicit Chain-of-Thought (CoT) rationales to train our dedicated evaluator, the AnyAudio-Judge model. By employing a training pipeline that combines Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO), our model successfully aligns its reasoning paths with the rubric-based scoring mechanism. Extensive experiments demonstrate that AnyAudio-Judge not only significantly enhances zero-shot alignment detection compared to state-of-the-art baselines, but also provides precise and interpretable reward signals that substantially improve instruction alignment in downstream reinforcement learning for audio generation.

2606.02913 2026-06-03 eess.AS cs.SD 版本更新

A Comparison of Generative and Discriminative Methods for Speech Enhancement: Robustness, Complexity, and Hallucination

生成式与判别式语音增强方法的比较:鲁棒性、复杂性与幻觉

Shrishti Saha Shetu, Emanuël A. P. Habets, Andreas Brendel

发表机构 * Fraunhofer IIS(弗劳恩霍夫研究所) Friedrich-Alexander-Universität Erlangen-Nürnberg(埃尔兰根-纽伦堡亚当-弗里德里希-亚历山大大学)

AI总结 本文比较了生成式和判别式深度学习方法在语音增强中的表现,分析了高/低信噪比、匹配/失配训练场景下的鲁棒性、复杂度与幻觉特性。

详情
AI中文摘要

在本研究中,我们对基于深度学习的生成式和判别式语音增强方法进行了全面的比较分析,特别是在降噪任务中。我们的研究重点在于评估它们在高低信噪比条件下的有效性,同时考虑匹配和不匹配的训练场景。我们进一步研究了训练数据量、模型收敛速度的影响,并根据所考虑的训练范式,从客观结果的角度解释了性能差异。此外,我们比较了这些方法的复杂度-性能权衡和实际可行性。为了进一步加强评估,我们研究了生成式方法在词错误率和音素相似度方面的幻觉特性。本研究得出的见解提供了经验证据,帮助研究人员和从业者理解不同方法的感知增益是否证明了其在实际应用中的计算成本是合理的。

英文摘要

In this study, we conduct a comprehensive comparative analysis of generative and discriminative deep learning-based speech enhancement methods, specifically in noise reduction tasks. Our investigation focuses on evaluating their effectiveness under high and low signal-to-noise ratio conditions, considering both matched and mismatched training scenarios. We further investigate the impact of training data volume, model convergence speed, and interpret the performance differences in terms of objective results for the considered training paradigms. Additionally, we compare the complexity-performance trade-off and the practical viability of these approaches. To further strengthen the evaluation, we study the hallucination characteristics of generative approaches in terms of word error rate and phoneme similarity. The insights derived from this study provide empirical evidence to assist researchers and practitioners in understanding whether the perceptual gains of different approaches justify their computational cost in practical applications.

2606.02642 2026-06-03 eess.AS cs.AI cs.CV cs.LG cs.MM cs.SD 版本更新

SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models

SVHalluc: 音频-视觉大语言模型中的语音-视觉幻觉基准测试

Chenshuang Zhang, Kyeong Seon Kim, Chengxin Liu, Tae-Hyun Oh

发表机构 * KAIST(韩国国立信息通信研究院)

AI总结 针对音频-视觉大语言模型中的语音-视觉幻觉问题,提出SVHalluc基准,从语义和时间两个维度评估模型将语音内容与视觉信号对齐的能力,发现现有模型存在跨模态理解局限。

Comments Accepted at CVPR 2026

详情
AI中文摘要

尽管音频-视觉大语言模型(LLMs)取得了成功,但它们可能产生看似合理但缺乏依据的输出,即幻觉。现有基准侧重于环境声音(例如狗叫)来指示事件发生。相比之下,人类语音承载着根本不同的、丰富的语义和时间结构,但当前模型能否准确地将语音内容与相应的视觉信号对齐仍未得到探索。在这项工作中,我们表明语音内容可以引发音频-视觉LLMs中的幻觉。为了系统研究这一点,我们引入了SVHalluc,这是第一个用于评估音频-视觉LLMs中语音-视觉幻觉的综合基准。我们的基准从两个关键且互补的方面诊断语音-视觉幻觉:语义和时间。实验结果表明,最先进的开源音频-视觉LLMs难以将语音内容与相应的视觉信号对齐,在多个任务上的准确率接近随机。相比之下,Gemini 2.5 Pro显著优于开源模型。我们的分析表明,它们的失败源于跨模态理解能力有限,尽管在单模态感知方面表现强劲。我们的工作揭示了当前音频-视觉LLMs的一个新的根本性局限,并强调了基于语音的视频理解的需求。项目页面:此https URL。

英文摘要

Despite the success of audio-visual large-language models (LLMs), they can produce plausible but ungrounded outputs, termed hallucination. Existing benchmarks focus on environmental sounds (e.g., dog barking) to indicate event occurrence. In contrast, human speech carries fundamentally different, rich semantics and temporal structures, yet it remains unexplored whether current models can accurately align speech content with corresponding visual signals. In this work, we show that speech content can induce hallucinations in audio-visual LLMs. To systematically study this, we introduce SVHalluc, the first comprehensive benchmark for evaluating speech-vision hallucination in audio-visual LLMs. Our benchmark diagnoses speech-vision hallucinations from two critical and complementary aspects: semantic and temporal. Experimental results demonstrate that state-of-the-art open-source audio-visual LLMs struggle with aligning speech content with corresponding visual signals, with a near-random accuracy on multiple tasks. In contrast, Gemini 2.5 Pro significantly outperforms the open-source models. Our analysis suggests that their failures stem from limited ability in cross-modality understanding, despite strong performance in single-modality perception. Our work uncovers a new and fundamental limitation of current audio-visual LLMs and highlights the need for speech-grounded video comprehension. Project page: https://chenshuang-zhang.github.io/projects/svhalluc/.

2606.02631 2026-06-03 eess.AS cs.AI cs.CV cs.LG cs.SD 版本更新

Wavelet as Tokenizer: Preliminary Results on a Shared Wavelet Token Schema for Natural Signals

小波作为分词器:自然信号共享小波分词方案的初步结果

Shenghao Ding

发表机构 * Yet Another AI

AI总结 本文研究音频、图像和视频能否共享统一的小波分词方案,通过基于Haar DWT/IDWT的连续令牌模型,在多个数据集上验证了统一分词模式的可行性,并分析了潜在容量和元数据的影响。

Comments 12 pages, 3 figures

详情
AI中文摘要

本文研究音频、图像和视频是否可以共享一个共同的小波令牌模式,而不是依赖于各自模态特定的潜在网格。它介绍了一个初步的连续令牌模型,该模型围绕一级Haar DWT/IDWT前端、共享系数令牌布局、可选结构元数据、轻量级模态值适配器和共享的令牌级编码器-解码器主干构建。在Speech Commands、EuroSAT RGB和DAVIS 2017数据上,密集共享模型达到了39.92 dB音频、29.37 dB图像和23.93 dB视频的PSNR。在连续潜在标量预算下的匹配速率扫描表明,视觉增益不能仅由潜在容量解释,同时也表明加性元数据嵌入并非普遍改进来源。最后,固定速率能量选择提供了一个强大的非参数基线:在压缩保留比率下,energy_global相比均匀选择将音频的平均PSNR提高了16.73 dB,图像提高了16.90 dB,视频提高了15.86 dB。掩蔽稀疏训练在50%的密集令牌下达到了34.45 dB的视频PSNR。结果支持统一的 wavelet 令牌模式和稀疏令牌接口,但尚未建立通用的离散词汇表。

英文摘要

This paper studies whether audio, images, and video can share a common wavelet token schema rather than relying on separate modality-specific latent grids. It introduces a preliminary continuous-token model built around a one-level Haar DWT/IDWT frontend, a shared coefficient-token layout, optional structural metadata, lightweight modality value adapters, and a shared token-wise encoder-decoder trunk. On Speech Commands, EuroSAT RGB, and DAVIS 2017 data, a dense shared model reaches 39.92 dB audio, 29.37 dB image, and 23.93 dB video PSNR. A matched-rate sweep under continuous latent scalar budgets indicates that the visual gains are not explained solely by latent capacity, while also showing that additive metadata embeddings are not a universal source of improvement. Finally, fixed-rate energy selection provides a strong non-parametric baseline: energy_global improves average PSNR over uniform selection by 16.73 dB for audio, 16.90 dB for images, and 15.86 dB for video under compressed keep ratios. Masked sparse training reaches 34.45 dB video PSNR with 50% of dense tokens. The results support a unified wavelet token schema and sparse token interface, while stopping short of establishing a universal discrete vocabulary.

2606.02615 2026-06-03 eess.AS cs.AI cs.SD 版本更新

FSA-GRPO: Teaching Auditory LLMs to Use Few-shot Demonstrations

FSA-GRPO:训练听觉大语言模型使用少样本示例

Haolong Zheng, Siyin Wang, Xulin Fan, Zengrui Jin, Mark Hasegawa-Johnson

发表机构 * University of Illinois Urbana Champaign(伊利诺伊大学厄巴纳-香槟分校) Tsinghua University(清华大学)

AI总结 提出基于强化学习的后训练方法FSA-GRPO,通过专门设计的奖励机制鼓励模型利用少样本示例,增强其少样本适应能力,在儿童语音识别、语音翻译和音频理解等任务上取得提升。

详情
AI中文摘要

少样本提示为将听觉大语言模型适应低资源任务(如儿童语音识别)提供了一种有效方式。然而,大多数听觉大语言模型并未被明确训练以在这种示例条件格式下进行推理,限制了它们从少样本提示中获益的程度。为解决这一局限,我们引入了少样本感知GRPO(FSA-GRPO),一种基于强化学习的后训练方法,使用专门设计的奖励来鼓励模型利用少样本示例,从而增强其少样本适应能力。值得注意的是,仅使用高资源成人ASR数据进行训练即可提升模型的通用少样本适应能力,不仅在儿童语音识别中带来收益,在语音翻译和音频理解中也是如此。我们进一步研究了数据选择和辅助奖励加权,以确定有效的训练方案。实验表明,当域内数据不可用或无法用于训练时,FSA-GRPO比直接对相关域外数据进行微调更有效。

英文摘要

Few-shot prompting provides an effective way to adapt auditory large language models to low-resource tasks such as children's speech recognition. However, most auditory large language models are not explicitly trained to perform inference in this demonstration-conditioned format, limiting the extent to which they can benefit from few-shot prompting. To address this limitation, we introduce Few-Shot Aware GRPO (FSA-GRPO), an RL-based post-training recipe that uses a specially designed reward to encourage the model to leverage few-shot demonstrations, thereby strengthening its few-shot adaptation ability. Notably, training with only high-resource adult ASR data improves the model's general few-shot adaptation ability, yielding gains not only in children's speech recognition but also in speech translation and audio understanding. We further study data selection and auxiliary reward weighting to identify an effective training recipe. Our experiments show that when in-domain data are unavailable or cannot be used for training, FSA-GRPO is more effective than direct tuning on related out-of-domain data.

2606.03957 2026-06-03 cs.CL cs.AI cs.SD eess.AS 版本更新

Efficient ASR Training with Conversations that Never Happened

利用从未发生的对话进行高效的ASR训练

Máté Gedeon, Péter Mihajlik

发表机构 * Dept. of Telecommunications and Artificial Intelligence, Budapest University of Technology and Economics(电信与人工智能系,布达佩斯技术与经济大学) SpeechTex Ltd.(SpeechTex公司) ELTE Research Centre for Linguistics(ELTE语言研究所)

AI总结 针对低资源语言和特定领域,提出通过LLM生成对话场景、映射说话人属性到TTS语音配置文件并组装合成话语的增强流水线,实验表明合成对话能有效提升ASR性能,在匈牙利语基准上仅用67小时真实对话和636小时模拟数据即超越2700小时零样本模型。

详情
AI中文摘要

低资源语言和特定领域的对话式ASR受到领域匹配的多说话人训练数据稀缺的限制。我们提出了一种增强流水线,该流水线生成带有参与者元数据的场景级对话,将说话人属性映射到TTS语音配置文件,并将合成的话语组装成感知说话人的模拟对话。我们在相同的FastConformer-Large训练方案下,评估了五种LLM家族,分别采用单生成器、固定预算混合和扩展设置。我们在匈牙利语BEA-Dialogue基准语料库上进行了全面评估,该方法本身适用于任何语言,只要各组件有相应资源。结果表明,合成对话持续改善语音识别性能,但生成器选择和组成数据强烈影响增益。我们最大的训练配置仅使用67小时真实对话和636小时模拟数据,在评估基准上实现了比在2700小时匈牙利语语音上训练的零样本模型更好的性能。这些发现表明,通过TTS合成的LLM生成的对话数据是真实对话语料库在语音模型训练中的实用补充。

英文摘要

Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with participant metadata, maps speaker attributes to TTS voice profiles, and assembles synthesized utterances into speaker-aware simulated conversations. We evaluated five LLM families under single-generator, fixed-budget mixture, and scale-up settings using the same FastConformer-Large training recipe for each one. We ran comprehensive evaluations on the Hungarian BEA-Dialogue benchmark corpus, with the method itself being applicable to any language given the resources for each component. The results show that synthetic conversations consistently improve speech recognition performance, but generator choice and data composition strongly affect the gains. Our largest training configuration, using only 67 hours of real conversations and 636 hours of simulated data, achieves better performance on the evaluation benchmark than a zero-shot model trained on 2700 hours of Hungarian speech. These findings indicate that LLM-generated conversational data synthesized with TTS is a practical complement to real conversational corpora for speech model training.

2606.03672 2026-06-03 cs.SD cs.MM 版本更新

Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation

Foley-Omni:从任务级音频合成到完整视频配乐生成的统一多模态生成模型

Ye Tao, Lupeng Liu, Xuenan Xu, Jiasun Feng, Jiarui Wang, Ying Qin, Shuiyang Mao, Wei Liu, Shuai Wang

发表机构 * School of Intelligence Science and Technology, Nanjing University(南京大学智能科学与技术学院) Video Rebirth Shanghai Jiao Tong University(上海交通大学) Beijing Jiaotong University(北京交通大学) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出Foley-Omni统一多模态音频生成模型,通过共享潜变量生成过程联合建模语音、音效和音乐,实现从孤立任务级合成到完整视频配乐生成,并构建V2ST-Bench基准进行综合评估。

详情
AI中文摘要

最近的统一音频生成模型可以支持语音、音效和音乐等多种任务,但大多数仍然专注于孤立的任务级合成。然而,真实的视频制作通常需要为同一视频联合且一致地生成完整音轨的多个组成部分。我们提出了Foley-Omni,一种统一的多模态音频生成模型,通过在一个共享的潜变量生成过程中联合建模语音、音效和音乐,将孤立的任务级合成扩展到完整的视频配乐生成。为了支持训练和可重复评估,我们开发了一个视听数据整理流程,并引入了V2ST-Bench,一个用于整体视频配乐生成评估的基准。实验表明,Foley-Omni在单个合成任务上与专家系统相比具有竞争性的性能,同时在混合配乐生成中提高了语音清晰度、视听一致性和感知质量。

英文摘要

Recent unified audio generation models can support diverse tasks across speech, sound effects, and music, but most of them still focus on isolated task-level synthesis. However, real video production often requires multiple components of a complete audio track to be generated jointly and consistently for the same video. We present Foley-Omni, a unified multimodal audio generation model that extends isolated task-level synthesis to complete video soundtrack generation by jointly modeling speech, sound effects, and music within a shared latent generation process. To support training and reproducible evaluation, we develop an audiovisual data curation pipeline and introduce V2ST-Bench, a benchmark for holistic video soundtrack generation evaluation. Experiments show that Foley-Omni achieves competitive performance with expert systems on individual synthesis tasks, while improving speech intelligibility, audiovisual consistency and perceptual quality for mixed soundtrack generation.

2606.03459 2026-06-03 cs.SD cs.AI 版本更新

Tonal parsimony in chord-sequence analysis: combining modulation cost and tonal vocabulary

和弦序列分析中的调性简约性:结合调制代价与调性词汇

François Pachet

发表机构 * LIP6, Sorbonne Université, Paris, France(LIP6,索邦大学,巴黎,法国) Ynosound, Paris, France(Ynosound,巴黎,法国)

AI总结 提出调性简约性方法,通过字典序最小化调制次数和不同调性数量,结合动态规划与固定24调性空间,在和弦序列分析中减少调性词汇并保持调制最优。

Comments 20 pages, 1 figure

详情
AI中文摘要

我们研究将局部调性分配给和弦序列,这一任务对和声分析、作曲和爵士即兴演奏很有用。标准的动态规划方法最小化调制,但可能引入不必要多的调性中心。我们将这种仅转移目标与纯最小词汇分析以及调性简约性进行比较,后者按字典序最小化调制次数,然后最小化不同调性的数量。尽管这个联合目标通常组合困难,但我们利用固定的24调性大调/小调宇宙给出了精确算法。在31,032个LMD和弦序列上,调性简约性在55.8%的情况下保持了转移最优,同时减少了调性词汇。在加权爵士替换闭包下,它将平均调性数从3.802降至3.206,调制次数从16.728降至12.141。在1,555个带注释的爵士标准曲上,它将兼容和弦-音阶一致性提高到95.6%,支持可处理的专业级和声分析。

英文摘要

We study the assignment of local tonalities to chord sequences, a task useful for harmonic analysis, composition, and jazz-oriented improvisation. Standard dynamic-programming approaches minimize modulations but can introduce unnecessarily many tonal centers. We compare this transition-only objective with pure minimum-vocabulary analysis and with tonal parsimony, which minimizes lexicographically the number of modulations and then the number of distinct tonalities. Although this joint objective is combinatorially hard in general, we give exact algorithms exploiting the fixed 24-tonality major/minor universe. On 31,032 LMD Chords sequences, tonal parsimony preserves the transition optimum while reducing tonal vocabulary in 55.8% of cases. With weighted jazz-substitution closure, it lowers mean tonalities from 3.802 to 3.206 and modulations from 16.728 to 12.141. On 1,555 annotated jazz standards, it improves compatible chord-scale agreement to 95.6%, supporting tractable professional-scale harmonic analysis.

2606.03359 2026-06-03 cs.SD cs.CL cs.LG 版本更新

Speech Emotion Recognition using Attention-based LSTM-Network with Residual Connection

基于注意力机制的残差连接LSTM网络的语音情感识别

Daniil Krasnoproshin, Maxim Vashkevich

发表机构 * Institute of Cybernetics and Machine Learning, Belarusian State University(白俄罗斯国立大学信息学与机器学习学院)

AI总结 提出ResLSTM-SA轻量级架构,在LSTM中集成残差连接和软注意力,在RAVDESS数据集上以46.8k参数达到0.6517 UAR,优于传统基线且适合边缘部署。

Comments 6 pages, 5 figures, DSPA 2026

详情
AI中文摘要

语音情感识别是现代人机交互系统的重要组成部分。然而,许多最先进的方法依赖于具有高计算和内存需求的大型预训练模型,限制了其适用性。本文提出了ResLSTM-SA,一种轻量级架构,在基于LSTM的框架中集成了残差连接和软注意力。在RAVDESS数据集上,在严格的说话人独立划分下进行评估,所提出的模型在未加权平均召回率(UAR)方面优于传统的基于注意力的LSTM基线以及几种先前报道的CNN和混合CNN-LSTM架构。性能最佳的变体(ResLSTM-SA-h64)仅用46.8k可训练参数就达到了0.6517的最大UAR,以比大规模自监督替代方案少三个数量级的参数提供了具有竞争力的准确性,从而能够在边缘设备和实时语音助手上高效部署。源代码可在以下网址获取:https://this URL。

英文摘要

Speech emotion recognition is an important component of modern human-computer interaction systems. However, many state-of-the-art approaches rely on large pretrained models with high computational and memory requirements, limiting their applicability. This paper proposes ResLSTM-SA, a lightweight architecture that integrates residual connections with soft attention within an LSTM-based framework. Evaluated on the RAVDESS dataset under strict speaker-independent partitioning, the proposed model outperforms conventional attention-based LSTM baselines and several previously reported CNN- and hybrid CNN-LSTM architectures in terms of unweighted average recall (UAR). The best-performing variant (ResLSTM-SA-h64) achieves a maximum UAR of 0.6517 with only 46.8k trainable parameters, delivering competitive accuracy with three orders of magnitude fewer parameters than large-scale self-supervised alternatives, thereby enabling efficient deployment on edge devices and real-time voice assistants. The source code is available at https://github.com/Mak-Sim/ResLSTM-SER.

2606.03183 2026-06-03 cs.MM cs.CV cs.SD eess.AS 版本更新

Inference-Time Scaling for Joint Audio-Video Generation

联合音视频生成的推理时缩放

Jaemin Jung, Kyeongha Rho, Inkyu Shin, Joon Son Chung

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院) Luma AI

AI总结 针对联合音视频生成中多目标优化的挑战,提出多验证器框架与自适应奖励加权算法,在无需额外训练的情况下显著提升语义对齐、感知质量和音视频同步。

Comments Accepted by Transactions on Machine Learning Research (TMLR). Project page: https://jung-jaemin.github.io/ITS-AVGen-Proj/

详情
AI中文摘要

联合音视频生成旨在合成与文本提示语义对齐且精确同步的逼真音视频对。现有联合音视频生成模型通常需要大量训练资源来提高保真度,而推理时缩放(ITS)最近在单模态领域成为一种有前景的无训练替代方案。然而,将ITS从单模态扩展到多模态领域并非易事,因为它需要平衡多个异构目标。在本文中,我们首次对联合音视频生成的ITS进行了全面研究。我们首先证明多验证器框架对于解决单目标指导的局限性(包括非对称性能权衡和验证器欺骗)至关重要。通过系统分析,我们随后确定了一个最优的多验证器组合,该组合在所有质量维度上产生均衡的改进。最后,为了有效聚合多样化的奖励信号,我们提出了自适应奖励加权(ARW),一种新颖的测试时优化算法。ARW将奖励聚合视为在线优化问题,利用可学习参数校准奖励方差,无需奖励分布的先验知识,从而确保鲁棒的多目标选择。在VGGSound和JavisBench-mini基准上的实验结果表明,我们的框架显著增强了生成输出的语义对齐、感知质量和音视频同步。合成样本和代码可在项目页面获取:this https URL。

英文摘要

Joint audio-video generation aims to synthesize realistic audio-video pairs that are both semantically aligned with text prompts and precisely synchronized. While existing joint audio-video generation models often require substantial training resources to improve fidelity, Inference-Time Scaling (ITS) has recently emerged as a promising training-free alternative in single-modality domains. However, extending ITS from a single modality to multimodal domains is non-trivial, as it requires balancing multiple heterogeneous objectives. In this paper, we present the first comprehensive study of ITS for joint audio-video generation. We first demonstrate that a multi-verifier framework is essential to address the limitations of single-objective guidance, including asymmetric performance trade-offs and verifier hacking. Through systematic analysis, we then identify an optimal multi-verifier combination that yields balanced improvements across all quality dimensions. Finally, to effectively aggregate diverse reward signals, we propose Adaptive Reward Weighting (ARW), a novel test-time optimization algorithm. ARW treats reward aggregation as an online optimization problem, utilizing learnable parameters to calibrate reward variances without requiring prior knowledge of reward distributions, thereby ensuring robust multi-objective selection. Experimental results on VGGSound and JavisBench-mini benchmarks demonstrate that our framework significantly enhances semantic alignment, perceptual quality, and audio-visual synchronization of generated outputs. Synthesized samples and code are available on the project page: https://jung-jaemin.github.io/ITS-AVGen-Proj.

2606.03169 2026-06-03 cs.SD cs.LG cs.MM 版本更新

SketchSong: Hierarchical Song Generation with Sketch Planning and Fine-Grained Multi-Track Modeling

SketchSong: 基于草图规划与细粒度多轨建模的分层歌曲生成

Xiaoyue Duan, Nanxing Hu, Yutang Feng, Xudong Yan, Jiatao Chen, Jinchao Zhang, Jie Zhou

发表机构 * Pattern Recognition Center, WeChat AI, Tencent Inc.(腾讯人工智能研究院)

AI总结 提出分层歌曲生成框架SketchSong,通过歌曲级草图规划和细粒度多轨建模解决歌曲编排不连贯及声部建模粗糙问题,在客观指标和人工听测上优于基线。

详情
AI中文摘要

最近的歌曲生成系统能够合成逼真的音频,但生成完整歌曲仍面临两个挑战。首先,现有方法中缺乏明确的歌曲级编排规划,模型往往需要在生成底层音频细节的同时组织整体编排发展,这常导致编排不连贯,如段落过渡薄弱和动态进展受限。其次,对不同音乐部分的粗粒度建模掩盖了它们各自的作用和交互,限制了生成歌曲的编排丰富性。本文提出SketchSong,一种分层歌曲生成框架,通过歌曲级草图规划和细粒度多轨建模解决这些问题。在时间维度上,SketchSong首先预测从压缩音频表示中提取的高层草图标记的紧凑序列,然后基于这些草图生成音频标记。这种从粗到细的过程在详细音频生成之前为模型提供了明确的编排规划。在轨道维度上,SketchSong显式建模四个轨道,即人声、贝斯、鼓和其他乐器。这使得模型能够更精确地捕捉不同音乐部分的作用和交互。在歌曲生成基准上的实验表明,SketchSong在客观指标和人工听测上均持续优于基线。尽管没有采用额外的偏好优化后训练(如歌词和文本提示对齐),SketchSong仍取得了与经过后训练的强开源系统相竞争的结果,证明了我们整体设计的有效性。

英文摘要

Recent song generation systems can synthesize realistic audio, yet generating complete songs remains challenging for two reasons. First, explicit song-level arrangement planning remains limited in existing methods, so models often need to organize overall arrangement development while generating low-level audio details. This often leads to incoherence in arrangements, such as weak section transitions and limited dynamic progression. Second, coarse modeling of different musical parts obscures their distinct roles and interactions, limiting arrangement richness of generated songs. In this paper, we present SketchSong, a hierarchical song generation framework that addresses these issues through song-level sketch planning and fine-grained multi-track modeling. Along the temporal dimension, SketchSong first predicts a compact sequence of high-level sketch tokens derived from compressed audio representations, and then generates audio tokens conditioned on these sketches. This coarse-to-fine process gives the model an explicit arrangement plan before detailed audio generation. Along the track dimension, SketchSong explicitly models four tracks, i.e., vocals, bass, drums and other instruments. This enables the model to capture the roles and interactions of different musical parts more precisely. Experiments on song generation benchmarks show that SketchSong consistently outperforms our baseline on both objective metrics and human listening tests. Despite not employing additional post-training for preference optimization such as lyrics and text-prompt alignments, SketchSong achieves competitive results against strong, post-trained open-source systems, demonstrating the effectiveness of our overall design.

2606.03028 2026-06-03 cs.SD 版本更新

Audio Spotforming via Post-Filtering Using Cross-Array Non-target Estimates

通过跨阵列非目标估计的后滤波实现音频点形成

Yuto Ishikawa, Li Li, Shogo Seki, Kouei Yamaoka

发表机构 * CyberAgent University of Tokyo(东京大学)

AI总结 提出一种利用跨阵列非目标估计进行后滤波的新方法,以替代低秩近似,提升从噪声混合中提取目标语音的性能。

Comments Accepted for EUSIPCO 2026

详情
AI中文摘要

音频点形成是一种通过利用多个麦克风阵列从噪声混合中提取目标语音的技术。传统方法通过低秩近似从每个阵列获得的线性分离信号中估计共享的目标语音成分,并基于该估计的低秩表示应用后滤波。然而,由于低秩模型与语音信号复杂结构之间的不匹配,直接依赖低秩近似进行后滤波会降低语音提取性能。在本研究中,我们利用一个观察:从一个阵列视角位于目标语音方向上的非目标成分,当从其他阵列观察时可以在空间上分离。这一见解激发了一种新的点形成方法,该方法利用跨阵列的非目标估计而非依赖低秩近似来实现高效的后滤波估计。实验表明,所提方法优于传统的点形成方法。

英文摘要

Audio spotforming is a technique for extracting target speech from noisy mixtures by utilizing multiple microphone arrays. Conventional methods estimate a shared target speech component from linearly separated signals obtained by each array using low-rank approximations and apply post filtering (PF) based on this estimated low-rank representation. However, owing to the mismatch between low-rank models and the complex structure of speech signals, directly relying on low-rank approximations for PF can degrade the speech extraction performance. In this study, we leverage the observation that non-target components located in the target speech direction from the perspective of one array can be spatially separated when viewed from other arrays. This insight motivates a new spotforming method for efficient post-filter estimation using non-target estimates across arrays instead of relying on low-rank approximations. Experiments demonstrate that the proposed method outperforms conventional spotforming methods.

2606.02980 2026-06-03 cs.SD cs.CY 版本更新

A Training-Efficient Transformer-Based Anti-Spoofing Network for Logical Access in ASVspoof 5

一种训练高效的基于Transformer的反欺骗网络用于ASVspoof 5中的逻辑访问

Sidan Yin, Bo Zhao

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 针对ASVspoof 5 Track 1封闭条件,提出TFPARN网络,结合焦点分类损失和成对排序损失,通过Transformer编码器和注意力池化实现高效反欺骗,在minDCF和EER上优于AASIST和RawNet2,且推理内存更低、训练更快。

Comments 11 pages, 2 figures

详情
AI中文摘要

合成和篡改的语音会降低自动说话人验证系统的可靠性,因此反欺骗方法需要在训练和推理中既准确又高效。本文聚焦于ASVspoof 5 Track 1封闭条件,其中标准交叉熵训练可能对困难样本关注不足,且不与基于排序和阈值的评估指标直接对齐。我们提出TFPARN,一种基于Transformer的焦点成对注意力排序网络。该系统从语音中提取log-Mel特征,使用Transformer编码器建模帧级信息,应用注意力池化获得话语级表示,并通过焦点分类损失和成对排序损失的组合进行训练。训练中使用RawBoost增强,评估时应用测试时增强以提高鲁棒性。与在相同协议下重新实现的AASIST和RawNet2基线相比,TFPARN取得了最佳结果,minDCF为0.2430,EER为12.52%。消融实验进一步表明,成对损失、焦点损失和注意力池化均能提升性能。TFPARN在比较系统中使用最低的推理内存(1.4 GB),每段话语运行时间约0.79毫秒,并且达到最佳检查点的训练时间少于AASIST。这些结果表明,TFPARN在逻辑访问反欺骗中实现了检测准确性和计算成本之间的良好平衡。

英文摘要

Synthetic and manipulated speech can reduce the reliability of automatic speaker verification systems, so anti-spoofing methods need to be both accurate and efficient in training and inference. This paper focuses on the ASVspoof 5 Track 1 closed condition, where standard cross-entropy training may not give enough attention to hard trials and is not directly aligned with ranking- and threshold-based evaluation metrics. We propose TFPARN, a Transformer-based focal-pairwise attentive ranking network. The system extracts log-Mel features from speech, uses a Transformer encoder to model frame-level information, applies attention pooling to obtain utterance-level representations, and is trained with a combination of focal classification loss and pairwise ranking loss. RawBoost augmentation is used during training, and test-time augmentation is applied during evaluation to improve robustness. Compared with re-implemented AASIST and RawNet2 baselines under the same protocol, TFPARN achieves the best results, with a minDCF of 0.2430 and an EER of 12.52%. Ablation experiments further show that the pairwise loss, focal loss, and attention pooling all improve performance. TFPARN also uses the lowest inference memory among the compared systems, at 1.4 GB, runs at about 0.79 ms per utterance, and reaches its best checkpoint in less training time than AASIST. These results show that TFPARN provides a good balance between detection accuracy and computational cost for logical access anti-spoofing.

2606.02739 2026-06-03 cs.SD cs.AI eess.AS 版本更新

EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement

EntangleCodec:通过语义-声学纠缠的统一离散音频分词器

Hui Li, Yangfan Gao, Junlin Shang, Changhao Jiang, Tao Gui, Qi Zhang, Xuanjing Huang

发表机构 * Fudan University(复旦大学)

AI总结 提出EntangleCodec,一种通过将音频与丰富标题对齐学习语义-声学联合表示的统一离散音频分词器,在紧凑令牌流中捕获语言内容、说话人身份、情感、韵律和声学场景,并通过流匹配扩散解码器实现高质量重建,在音频理解和生成任务上均取得领先性能。

Comments 17 pages, 10 figures

详情
AI中文摘要

音频分词器作为连续音频与音频语言模型(ALM)之间的离散接口,但现有分词器往往难以同时支持理解和生成。面向重建的编解码器保持声学保真度但缺乏丰富语义,而语义感知分词器通常依赖独立的语义和声学流,引入冗余或错位。我们提出 extbf{EntangleCodec},一种统一的离散音频分词器,在量化之前学习与标题对齐的语义-声学表示。通过将音频与丰富标题而非ASR转录对齐,EntangleCodec在紧凑令牌流中捕获语言内容、说话人身份、情感、韵律和声学场景。流匹配扩散解码器进一步实现了语音、音乐和通用音频的高质量重建。EntangleCodec在重建质量上与专用编解码器竞争,在音频理解上优于所有基于编解码器的基线,在MMAR上提升高达 extbf{+7.4\%},并在统一框架中支持TTS和TTA生成。此外,基于EntangleCodec的音频语言模型展现出强大的扩展行为:即使参数为 extit{0.6B},该模型在三个基准测试中超越了参数超过 extit{13B}的专用连续表示LLM,参数减少了 extbf{22$ imes$};扩展到 extit{8B}进一步在MMAR上建立了新的最先进结果,突显了在音频语言建模中表示质量与模型规模同等重要。代码和模型权重可从此https URL获取。

英文摘要

Audio tokenizers serve as the discrete interface between continuous audio and Audio Language Models (ALMs), but existing tokenizers often struggle to support both understanding and generation. Reconstruction-oriented codecs preserve acoustic fidelity but lack rich semantics, while semantic-aware tokenizers typically rely on separate semantic and acoustic streams, introducing redundancy or misalignment. We propose \textbf{EntangleCodec}, a unified discrete audio tokenizer that learns caption-aligned semantic-acoustic representations before quantization. By aligning audio with rich captions rather than ASR transcripts, EntangleCodec captures linguistic content, speaker identity, emotion, prosody, and acoustic scenes within a compact token stream. A flow-matching diffusion decoder further enables high-quality reconstruction across speech, music, and general audio. EntangleCodec achieves reconstruction quality competitive with specialized codecs, outperforms all codec-based baselines on audio understanding by up to \textbf{+7.4\%} on MMAR, and supports both TTS and TTA generation in a unified framework. Furthermore, EntangleCodec-based audio language models demonstrate strong scaling behavior: even at \textit{0.6B} parameters, the model surpasses specialized continuous-representation LLMs with over \textit{13B} parameters across three benchmarks using \textbf{22$\times$} fewer parameters; scaling to \textit{8B} further establishes new state-of-the-art results on MMAR, highlighting that representation quality is as critical as model scale in audio language modeling. Code and model weights are available at https://github.com/luckyerr/EntangleCodec.

2606.02679 2026-06-03 cs.LG cs.MM cs.SD eess.AS 版本更新

Before Fusion, Ask What to Keep: Contextual Calibration of Multimodal Signals

融合之前,先问保留什么:多模态信号的上下文校准

Jiyuan Liu, Liangwei Nathan Zheng, Wei Emma Zhang, Xinpei Wang, Weitong Chen

发表机构 * Adelaide University(阿德莱德大学) Shandong University(山东大学)

AI总结 提出一种紧凑的校准模块,在融合前对各模态特征进行实例级和维度级调制,抑制不可靠成分并增强上下文支持信号,提升多模态任务性能。

Comments 11 pages, 7 figures, 9 tables

详情
AI中文摘要

多模态系统通常受益于跨语言、声音和视觉流的信息组合,但这种收益并非保证。一个模态对某个输入有用,可能对另一个输入成为干扰,同一模态内的局部特征响应可能与其他来源的证据不一致。本文研究如何在下游预测器合并多模态表示之前调整它们。我们开发了一个紧凑的校准模块,在摘要级别将每个模态与其他模态进行比较,提取跨源支持和冲突的线索,并将这些线索转换为实例级和维度级的调制信号。校准应用于原始模态特征而非已融合的表示,使模型能够抑制误导成分,保留微弱但有用的证据,并强调在当前多模态上下文中得到更好支持的响应。该模块设计为即插即用组件,可附加到不同的融合主干上,无需更改其预测头。在涵盖情感理解、动作识别、音视频事件检测和音视频情感分类的五个基准测试中,所提出的预融合校准策略在基于序列和卷积的融合设置下均提升了性能。模态移除、合成损坏、训练动态和特征级可视化的额外分析表明,在融合前校准信号可以减少来自不可靠模态的干扰,并产生更稳定的多模态优化。

英文摘要

Multimodal systems often benefit from combining information across language, sound, and visual streams, but this benefit is not guaranteed. A modality that is useful for one input may become distracting for another, and local feature responses within the same modality can disagree with evidence from other sources. This work investigates how to adjust multimodal representations before they are merged by a downstream predictor. We develop a compact calibration module that compares each modality with the others at the summary level, extracts cues of cross-source support and conflict, and converts these cues into instance-wise and dimension-wise modulation signals. The calibration is applied to the original modality features rather than to already fused representations, enabling the model to suppress misleading components, preserve weak but useful evidence, and emphasize responses that are better supported by the current multimodal context. The module is designed as a plug-in component and can be attached to different fusion backbones without changing their prediction heads. Across five benchmarks covering sentiment understanding, action recognition, audio-visual event detection, and audio-visual emotion classification, the proposed pre-combination calibration strategy improves performance under both sequence-based and convolutional fusion settings. Additional analyses under modality removal, synthetic corruption, training dynamics, and feature-level visualization show that calibrating signals before fusion can reduce interference from unreliable modalities and produce more stable multimodal optimization.

2606.02638 2026-06-03 cs.SD cs.AI eess.AS 版本更新

SegTune: Structured and Fine-Grained Control for Song Generation

SegTune:歌曲生成的结构化与细粒度控制

Yuejiao Wang, Zihao Ji, Pengfei Cai, Xu Li, Haorui Zheng, Zewen Song, Zhongliang Liu, Chen Zhang, Pengfei Wan

发表机构 * Kling Team, Kuaishou Technology(快手科技 Kling 团队) University of Science and Technology of China(中国科学技术大学) Peking University(北京大学)

AI总结 提出基于扩散Transformer的SegTune框架,通过用户或LLM指定局部音乐描述实现结构化细粒度控制,并引入LLM时长预测器实现精确歌词-音乐对齐,在音乐性和可控性上超越现有基线。

Comments This paper has been accepted to ACL 2026 as an oral presentation and has been nominated for the Best Paper Award. This work is a revised and extended version of an earlier technical report (arXiv:2510.18416). arXiv admin note: text overlap with arXiv:2510.18416

详情
AI中文摘要

近期神经歌曲生成的进展使得从歌词和全局文本提示中实现高质量合成成为可能。然而,大多数系统无法建模歌曲随时间变化的属性,严重限制了音乐结构和动态的细粒度控制。为解决这一问题,我们提出SegTune,一个基于扩散Transformer的框架,通过允许用户或大型语言模型(LLM)指定与歌曲片段对齐的局部音乐描述,实现结构化和细粒度的可控性。这些片段提示被时间广播到对应的时间窗口,而全局提示则确保风格连贯性。为支持精确的歌词-音乐对齐,我们引入了一个基于LLM的时长预测器,以LyRiCs格式自回归生成句子级时间戳。我们进一步构建了一个大规模数据管道,用于收集高质量歌曲及其对齐的歌词和提示,并提出了新的指标来评估片段对齐和声乐一致性。实验表明,SegTune在音乐性和可控性方面均优于现有基线。访问我们的项目页面(此 https URL )获取代码和更多生成的歌曲。

英文摘要

Recent advances in neural song generation have enabled high-quality synthesis from lyrics and global textual prompts. However, most systems fail to model temporally varying attributes of songs, severely limiting fine-grained control over musical structure and dynamics. To address this, we propose SegTune, a Diffusion Transformer-based framework enabling structured and fine-grained controllability by allowing users or large language models (LLMs) to specify local musical descriptions aligned to song segments. These segment prompts are temporally broadcast to corresponding time windows, while global prompts ensure stylistic coherence. To support precise lyric-to-music alignment, we introduce an LLM-based duration predictor that autoregressively generates sentence-level timestamps in LyRiCs format. We further construct a large-scale data pipeline for high-quality song collection with aligned lyrics and prompts, and propose new metrics to evaluate segment alignment and vocal consistency. Experiments demonstrate that SegTune outperforms existing baselines in both musicality and controllability. Visit our project page (https://github.com/KlingAIResearch/SegTune) for codes and more generated songs.

2605.31530 2026-06-03 eess.AS cs.SD 版本更新

UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion

UNISON: 通过深度LLM融合的统一声音生成与编辑框架

Zhaoqing Li, Haoning Xu, Jingran Su, Yaofang Liu, Zhefan Rao, Huimeng Wang, Jiajun Deng, Tianzi Wang, Zengrui Jin, Rui Liu, Haoxuan Che, Xunying Liu

发表机构 * The Chinese University of Hong Kong(香港中文大学) The Hong Kong Polytechnic University(香港理工大学) City University of Hong Kong(香港城市大学) The Hong Kong University of Science and Technology(香港科学与技术大学) Tsinghua University(清华大学) Huawei Research Hong Kong(华为香港研究)

AI总结 提出UNISON,一个基于潜在扩散的统一框架,通过层间深度LLM融合和多任务架构,实现语音生成、声音生成和音频编辑,在多个任务上达到或超越专业模型性能,且参数量减少约4倍。

详情
AI中文摘要

我们提出UNISON,一个潜在扩散框架,将语音生成、声音生成和音频编辑统一在单个模型中。单个模型处理文本到音频、文本到语音、零样本说话人克隆、混合语音与声音生成、场景级音频编辑、场景中语音编辑以及定时时间组合,所有这些任务共享一组权重。我们的架构具有两个核心设计:(1) 层间深度LLM融合,通过学习的投影将来自冻结MLLM均匀采样层的隐藏状态注入对应的MM-DiT块,提供深度匹配的语义条件,改善指令遵循能力,优于单层基线;(2) 统一的多任务架构,其中任务身份仅由通道掩码编码,源音频通过VAE编码的通道拼接提供。训练通过在线GPU端多任务数据合成流水线(具有任务同质批处理和两阶段课程)稳定进行。拥有621M至732M可训练参数,UNISON在评估的各个领域取得了与任务专业模型竞争或超越的结果,同时比类似统一系统小约4倍。

英文摘要

We present UNISON, a latent diffusion framework that unifies speech generation, sound generation, and audio editing within a single model. A single model handles text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level audio editing, speech-in-scene editing, and timed temporal composition, all of which share a single set of weights. Our architecture features two core designs: (1) Layer-wise deep LLM fusion, which injects hidden states from uniformly sampled layers of a frozen MLLM into corresponding MM-DiT blocks via learned projections, providing depth-matched semantic conditioning that improves instruction following over single-layer baselines; and (2) a unified multi-task architecture where task identity is encoded solely by a channel-wise mask and source audio is provided through VAE-encoded channel concatenation. Training is stabilized by an online GPU-side multi-task data synthesis pipeline with task-homogeneous batching and a two-stage curriculum. With 621M--732M trainable parameters, UNISON achieves results competitive with or exceeding task-specialist models across evaluated domains, while being roughly $4\times$ smaller than comparable unified systems.

2601.22599 2026-06-03 cs.SD cs.HC 版本更新

A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation

用于数据高效查询式通用声音分离的语义一致数据集

Kai Li, Jintao Cheng, Chang Zeng, Zijun Yan, Helin Wang, Zixiong Su, Bo Zheng, Xiaolin Hu

发表机构 * Department of Computer Science and Technology, Institute for AI, BNRist, Tsinghua University, Beijing, China(计算机科学与技术系,人工智能研究所,BNRist,清华大学,北京,中国) Shanda AI Research Tokyo(莎莎人工智能研究东京) IDG/McGovern Institute for Brain Research, Tsinghua University, Beijing, China(IDG/麦戈文脑研究 institute,清华大学,北京,中国) Johns Hopkins University(约翰霍普金斯大学) Chinese Institute for Brain Research (CIBR), Beijing, China(中国脑研究 institute(CIBR),北京,中国)

AI总结 提出自动管道通过语义一致合成协议消除事件共现,构建高质量合成数据集Hive,使模型在数据量极小的情况下达到与大规模训练模型相当的分离精度和泛化能力。

Comments Accepted to ICML 2026

详情
AI中文摘要

查询式通用声音分离是智能听觉系统的基础,旨在从混合声音中分离特定声源。尽管最近取得了进展,现有方法在复杂声学场景中仍存在残余干扰。这种性能限制主要源于数据瓶颈:野外数据集包含弱标签和严重的事件共现。这些缺陷导致模型学习背景噪声与目标类别之间的虚假相关性,而非鲁棒的声学特征。为解决这一问题,我们提出了一种自动管道,通过语义一致合成协议从野外数据集中挖掘高纯度单事件片段,从而消除事件共现。利用该管道,我们构建了Hive,一个包含2400小时原始音频的高质量合成数据集。实验结果表明,与在比Hive大约500倍的大数据集上训练的最先进模型SAM-Audio相比,在Hive上训练的某些开源模型达到了具有竞争力的分离精度和感知质量。此外,这些模型在分布外评估基准上表现出显著的零样本泛化能力。这些发现强调,优先考虑监督信号的纯度可以实现显著的数据效率,为以降低计算成本训练鲁棒的听觉基础模型提供了新范式。代码和数据集可在https://cslikai.cn/Hive获取。

英文摘要

Query-based universal sound separation is fundamental to intelligent auditory systems, aiming to isolate specific sources from mixtures. Despite recent advances, existing methods continue to suffer from residual interference in complex acoustic scenes. This performance limitation stems largely from a data bottleneck: in-the-wild datasets contain weak labels and severe co-occurrence of events. These flaws induce models to learn spurious correlations between background noise and target categories instead of robust acoustic features. To address this, we propose an automated pipeline that eliminates co-occurrence of events by mining high-purity single-event segments from in-the-wild datasets via a semantically consistent synthesis protocol. Utilizing this pipeline, we constructed Hive, a high-quality synthetic dataset comprising 2.4k hours of raw audio. Experimental results demonstrate that, compared with the state-of-the-art model SAM-Audio which was trained on a huge dataset $\sim$500 times larger than Hive, certain open-source models trained on Hive achieve competitive separation accuracy and perceptual quality. Moreover, these models exhibited remarkable zero-shot generalization on out-of-distribution evaluation benchmarks. These findings highlight that prioritizing purity of supervised signals enables significant data efficiency, offering a new paradigm for training robust auditory foundation models with reduced computational costs. Code and dataset are available at https://cslikai.cn/Hive.

2509.09685 2026-06-03 cs.IR cs.AI cs.MM cs.SD eess.AS 版本更新

TalkPlayData 2: An Agentic Synthetic Data Pipeline for Multimodal Conversational Music Recommendation

TalkPlayData 2:用于多模态对话式音乐推荐的智能体合成数据流水线

Keunwoo Choi, Seungheon Doh, Juhan Nam

发表机构 * KAIST(韩国科学技术院)

AI总结 提出TalkPlayData 2,一个由智能体数据流水线生成的多模态对话式音乐推荐合成数据集,通过多角色大语言模型模拟对话并覆盖多种场景,以支持生成式推荐模型训练。

详情
AI中文摘要

我们提出了TalkPlayData 2,一个由智能体数据流水线生成的多模态对话式音乐推荐合成数据集。在该流水线中,多个大语言模型(LLM)智能体被创建,承担不同角色,具有专门的提示词和访问不同信息部分的权限,通过记录Listener LLM和Recsys LLM之间的对话来获取聊天数据。为了覆盖各种对话场景,每个对话的Listener LLM基于微调的对话目标进行条件设置。最后,所有LLM都是多模态的,支持音频和图像,从而模拟多模态推荐和对话。在LLM-as-a-judge和主观评估实验中,TalkPlayData 2在训练音乐生成式推荐模型的各个方面达到了预期目标。TalkPlayData 2及其生成代码已在https://talkpl-ai.github.io发布。

英文摘要

We present TalkPlayData 2, a synthetic dataset for multimodal conversational music recommendation generated by an agentic data pipeline. In the proposed pipeline, multiple large language model (LLM) agents are created under various roles with specialized prompts and access to different parts of information, and the chat data is acquired by logging the conversation between the Listener LLM and the Recsys LLM. To cover various conversation scenarios, for each conversation, the Listener LLM is conditioned on a finetuned conversation goal. Finally, all the LLMs are multimodal with audio and images, allowing a simulation of multimodal recommendation and conversation. In the LLM-as-a-judge and subjective evaluation experiments, TalkPlayData 2 achieved the proposed goal in various aspects related to training a generative recommendation model for music. TalkPlayData 2 and its generation code are released at https://talkpl-ai.github.io.

2510.01698 2026-06-03 cs.IR cs.MM cs.SD eess.AS 版本更新

TalkPlay-Tools: Conversational Music Recommendation with LLM Tool Calling

TalkPlay-Tools: 基于大语言模型工具调用的对话式音乐推荐

Seungheon Doh, Keunwoo Choi, Juhan Nam

发表机构 * KAIST(韩国科学技术院) talkpl.ai

AI总结 提出一种基于LLM工具调用的统一检索-重排序流水线,通过布尔过滤、稀疏检索、稠密检索和生成式检索的组合,实现端到端的对话式音乐推荐。

Comments Accepted for publication at The Workshop on AI for Music, Neural Information Processing Systems (NeurIPS-AI4Music)

详情
AI中文摘要

尽管大型语言模型(LLM)的最新进展已成功实现了具有自然语言交互的生成式推荐系统,但其推荐行为受限,导致系统中其他更简单但关键组件(如元数据或属性过滤)未被充分利用。我们提出了一种基于LLM的音乐推荐系统,通过工具调用作为统一的检索-重排序流水线。该系统将LLM定位为端到端推荐系统,解释用户意图、规划工具调用并编排专门组件:布尔过滤(SQL)、稀疏检索(BM25)、稠密检索(嵌入相似度)和生成式检索(语义ID)。通过工具规划,系统预测要使用的工具类型、执行顺序以及查找匹配用户偏好的音乐所需的参数,支持多种模态,同时无缝集成多个数据库过滤方法。我们证明,这种统一的工具调用框架通过基于用户查询选择性地采用适当的检索方法,在多种推荐场景中实现了有竞争力的性能,为对话式音乐推荐系统设想了新的范式。

英文摘要

While the recent developments in large language models (LLMs) have successfully enabled generative recommenders with natural language interactions, their recommendation behavior is limited, leaving other simpler yet crucial components such as metadata or attribute filtering underutilized in the system. We propose an LLM-based music recommendation system with tool calling to serve as a unified retrieval-reranking pipeline. Our system positions an LLM as an end-to-end recommendation system that interprets user intent, plans tool invocations, and orchestrates specialized components: boolean filters (SQL), sparse retrieval (BM25), dense retrieval (embedding similarity), and generative retrieval (semantic IDs). Through tool planning, the system predicts which types of tools to use, their execution order, and the arguments needed to find music matching user preferences, supporting diverse modalities while seamlessly integrating multiple database filtering methods. We demonstrate that this unified tool-calling framework achieves competitive performance across diverse recommendation scenarios by selectively employing appropriate retrieval methods based on user queries, envisioning a new paradigm for conversational music recommendation systems.

2502.13713 2026-06-03 cs.IR cs.SD eess.AS 版本更新

TALKPLAY: Multimodal Music Recommendation with Large Language Models

TALKPLAY: 基于大语言模型的多模态音乐推荐

Seungheon Doh, Keunwoo Choi, Juhan Nam

发表机构 * KAIST(韩国科学技术院) talkpl.ai

AI总结 提出TALKPLAY系统,通过将推荐转化为token生成问题,利用大语言模型处理多模态音乐特征,实现端到端对话式推荐,显著优于单模态方法。

详情
AI中文摘要

我们提出TALKPLAY,一种新颖的多模态音乐推荐系统,它将推荐重新表述为使用大语言模型(LLM)的token生成问题。通过利用LLM的指令遵循和自然语言生成能力,我们的系统能够从多样化的用户查询中有效推荐音乐,同时生成上下文相关的响应。虽然预训练的LLM主要设计用于文本模态,但TALKPLAY通过两个关键创新扩展了其范围:一个多模态音乐分词器,用于编码音频特征、歌词、元数据、语义标签和播放列表共现信号;以及一个词汇扩展机制,能够统一处理和生成语言和音乐相关的token。通过将推荐系统直接集成到LLM架构中,TALKPLAY通过以下方式改造传统系统:(1)将先前的两阶段对话推荐系统(推荐引擎和对话管理器)统一为连贯的端到端系统,(2)有效利用长对话上下文进行推荐,同时在扩展的多轮交互中保持强劲性能,以及(3)生成自然语言响应以实现无缝的用户交互。我们的定性和定量评估表明,TALKPLAY在推荐性能和对话自然度方面显著优于仅基于文本或收听历史的单模态方法。

英文摘要

We present TALKPLAY, a novel multimodal music recommendation system that reformulates recommendation as a token generation problem using large language models (LLMs). By leveraging the instruction-following and natural language generation capabilities of LLMs, our system effectively recommends music from diverse user queries while generating contextually relevant responses. While pretrained LLMs are primarily designed for text modality, TALKPLAY extends their scope through two key innovations: a multimodal music tokenizer that encodes audio features, lyrics, metadata, semantic tags, and playlist co-occurrence signals; and a vocabulary expansion mechanism that enables unified processing and generation of both linguistic and music-relevant tokens. By integrating the recommendation system directly into the LLM architecture, TALKPLAY transforms conventional systems by: (1) unifying previous two-stage conversational recommendation systems (recommendation engines and dialogue managers) into a cohesive end-to-end system, (2) effectively utilizing long conversational context for recommendation while maintaining strong performance in extended multi-turn interactions, and (3) generating natural language responses for seamless user interaction. Our qualitative and quantitative evaluation demonstrates that TALKPLAY significantly outperforms unimodal approaches based solely on text or listening history in both recommendation performance and conversational naturalness.

2412.05123 2026-06-03 cs.SD eess.AS 版本更新

Differentiable Optimization of Linear Differential Microphone Arrays: A Joint Geometry and Filter Design Framework

线性差分麦克风阵列的可微优化:联合几何与滤波器设计框架

Siminfar Samakoush Galougah, Ramani Duraiswami

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 提出一种可微优化框架,通过联合优化麦克风位置和滤波器权重,实现线性差分麦克风阵列的最优波束模式,在保证无失真约束的同时兼顾指向性、鲁棒性和硬件效率。

Comments 5 pages, 4 figures, 2 tables

详情
AI中文摘要

本文提出了一种用于约束线性差分麦克风阵列(LDMA)设计的可微优化框架。该方法采用非均匀延迟求和波束形成器作为轻量级基础系统模型,通过联合优化麦克风位置和滤波器权重,证明了其能够实现LDMA的最优波束模式。该公式能够在期望声方向实现无失真约束的滤波器优化设计,同时对麦克风定位施加约束以确保一致性能。通过多个指标的评估,包括均方误差(MSE)、指向性指数(DI)、白噪声增益(WNG)和计算时间,并与最先进方法进行比较,该方法展示了一种灵活、指向性强、鲁棒且硬件高效的设计。

英文摘要

This paper presents a differentiable optimization framework for the design of constrained Linear Differential Microphone Arrays (LDMAs). The proposed method leverages a non-uniform delay-and-sum beamformer as a light-weight base system model, proving its ability to achieve the optimal beampattern of LDMAs by jointly optimizing microphone positions and filter weights. The formulation enables the optimized design of a filter with a distortion-free constraint in the desired sound direction, while also imposing constraints on microphone positioning to ensure consistent performance. Through evaluation on multiple metrics, including Mean Squared Error (MSE), Directivity Index (DI), White Noise Gain (WNG), and computation time, and comparison with state-of-the-art methods, this approach demonstrates a flexible, directive, robust, and hardware-efficient design.

0804.4347 2026-06-03 math.NA cs.NA cs.SD math.FA 版本更新

Nonorthogonal Bases and Phase Decomposition: Properties and Applications

非正交基与相位分解:性质与应用

Sossio Vergara

AI总结 本文提出一种基于单一函数的迭代分析方法,将傅里叶定理的极坐标形式推广到非正交基,并展示了其在函数分析与重构、噪声抑制等方面的应用。

Comments 11 pages

详情
Journal ref
Published in : Digital Signal Processing (2014), pp. 223-230
AI中文摘要

在之前的论文[1]中,讨论了使用一对通用函数作为基进行泛函分析的可行性,以及向量分解。本文通过利用其中开发的一种分析方法(应用于相位坐标)完善了这一范式,因此只需要一个函数作为基。我们将证明,得益于新颖的迭代分析,任何满足相当宽松条件的函数在本质上都是一个基。这进而将傅里叶定理的极坐标形式推广到一大类非正交基。这种推广的主要优势在于它继承了原始傅里叶定理的一些性质。因此,新变换具有广泛的应用和一些显著的结果。我们将新工具与小波和框架进行比较。将给出使用所开发算法和通用基进行函数分析与重构的示例。将讨论一些可以迅速受益于该理论的性质和应用。将使用用于噪声抑制的匹配滤波器的实现作为该理论潜力的示例。

英文摘要

In a previous paper [1] it was discussed the viability of functional analysis using as a basis a couple of generic functions, and hence vectorial decomposition. Here we complete the paradigm exploiting one of the analysis methodologies developed there, but applied to phase coordinates, so needing only one function as a basis. It will be shown that, thanks to the novel iterative analysis, any function satisfying a rather loose requisite is ontologically a basis. This in turn generalizes the polar version of the Fourier theorem to an ample class of nonorthogonal bases. The main advantage of this generalization is that it inherits some of the properties of the original Fourier theorem. As a result the new transform has a wide range of applications and some remarkable consequences. The new tool will be compared with wavelets and frames. Examples of analysis and reconstruction of functions using the developed algorithms and generic bases will be given. Some of the properties, and applications that can promptly benefit from the theory, will be discussed. The implementation of a matched filter for noise suppression will be used as an example of the potential of the theory.

0906.5202 2026-06-03 math.NA cs.NA cs.SD 版本更新

Superposition frames for adaptive time-frequency analysis and fast reconstruction

用于自适应时频分析与快速重构的叠加框架

Daniel Rudoy, Prabahan Basu, Patrick J. Wolfe

AI总结 本文提出一类称为叠加框架的自适应线性时频表示,具有类似短时傅里叶变换的快速重叠相加重构特性,并通过确定性及随机信号自适应准则实现数值稳定可逆表示。

Comments 16 pages, 6 figures; revised version

详情
Journal ref
IEEE Transactions on Signal Processing, vol. 58, pp. 2581-2596, 2010
AI中文摘要

本文介绍了一类广泛的自适应线性时频表示,称为叠加框架,并证明它们具有类似标准短时傅里叶技术的快速重叠相加重构特性。这一方法与现有文献中的许多自适应时频表示形成对比,后者虽然比标准固定分辨率方法更灵活,但通常无法提供高效重构,且往往缺乏精确框架理论分析所需的规则结构。我们的主要技术贡献在于开发了确保该构造提供数值稳定、可逆信号表示的性质。我们的主要算法贡献在于基于时频集中性和非平稳性检测,分别在确定性和随机设置中引入并讨论了特定的信号自适应准则。最后,我们通过一个简短的语音增强示例来突出我们方法的潜在应用。

英文摘要

In this article we introduce a broad family of adaptive, linear time-frequency representations termed superposition frames, and show that they admit desirable fast overlap-add reconstruction properties akin to standard short-time Fourier techniques. This approach stands in contrast to many adaptive time-frequency representations in the extant literature, which, while more flexible than standard fixed-resolution approaches, typically fail to provide efficient reconstruction and often lack the regular structure necessary for precise frame-theoretic analysis. Our main technical contributions come through the development of properties which ensure that this construction provides for a numerically stable, invertible signal representation. Our primary algorithmic contributions come via the introduction and discussion of specific signal adaptation criteria in deterministic and stochastic settings, based respectively on time-frequency concentration and nonstationarity detection. We conclude with a short speech enhancement example that serves to highlight potential applications of our approach.

1003.2441 2026-06-03 cs.SD cs.NA math.NA 版本更新

Up-sampling and Natural Sample Value Computation for Digital Pulse Width Modulators

数字脉冲宽度调制器的上采样与自然采样值计算

Kien C. Nguyen, Dilip V. Sarwate

AI总结 提出一种结合上采样、数字插值和自然采样转换的新方法,通过多相数字插值滤波器和数字微分器实现,以降低数字脉冲宽度调制中的谐波失真。

详情
AI中文摘要

数字脉冲宽度调制已被考虑用于高保真和高效率的音频放大器多年。研究表明,如果开关频率远高于调制波形的奈奎斯特率,则可以减少失真并简化系统实现。因此,输入数字源通常被上采样到更高的频率。同时,也证明了将均匀样本转换为自然样本会降低谐波失真。因此,在本文中,我们研究了一种结合上采样、数字插值和自然采样转换的新方法。该方法使用数字插值滤波器和数字微分器的多相实现。我们将展示该结构由一个FIR型线性级和一个非线性级组成。还将展示基于该方法的脉冲宽度调制系统的一些频谱仿真结果。最后,我们将讨论新方法相对于旧算法的改进。

英文摘要

Digital pulse width modulation has been considered for high-fidelity and high-efficiency audio amplifiers for several years. It has been shown that the distortion can be reduced and the implementation of the system can be simplified if the switching frequency is much higher than the Nyquist rate of the modulating waveform. Hence, the input digital source is normally upsampled to a higher frequency. It was also proved that converting uniform samples to natural samples will decrease the harmonic distortion. Thus, in this paper, we examine a new approach that combines upsampling, digital interpolation and natural sampling conversion. This approach uses poly-phase implementation of the digital interpolation filter and digital differentiators. We will show that the structure consists of an FIR-type linear stage and a nonlinear stage. Some spectral simulation results of a pulse width modulation system based on this approach will also be presented. Finally, we will discuss the improvement of the new approach over old algorithms.