arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.06550 2026-06-08 cs.SD cs.AI eess.AS 新提交

Geometric Second-Order Feature Correlation Learning for Self-Supervised Speech Emotion Recognition

几何二阶特征相关性学习用于自监督语音情感识别

Shuanglin Li, Ruxiao Qian, Siyang Song

发表机构 * Xiangjiang Laboratory(湘江实验室) University of Exeter(埃克塞特大学)

AI总结 针对自监督语音情感识别中一阶聚合忽略特征相关性和黎曼几何的问题,提出二阶相关层,通过协方差描述子捕获协同共现模式,并利用对数欧几里得映射保持几何完整性,在ESD和RAVDESS数据集上有效恢复判别信息。

详情
AI中文摘要

自监督学习(SSL)为语音情感识别(SER)提供了强大且富含上下文的表示,但将这些表示聚合为整体描述符仍是一个瓶颈。传统的一阶聚合隐式假设特征独立,忽略了潜在的黎曼几何,并丢弃了对骨干网络表示能力至关重要的高阶关系。为解决这一问题,本文提出了一种新颖的二阶相关(SOC)层。SOC不孤立地处理特征,而是将特征相关性建模为协方差描述子,以捕获协同共现模式,这些模式可作为鲁棒情感识别的判别性签名。通过对数欧几里得映射(LEM)将这些描述子从黎曼流形映射到欧几里得切空间,所提方法在保持几何完整性的同时,实现了直接的线性判别学习。在ESD和RAVDESS数据集上的大量实验表明,SOC恢复了一阶池化中丢失的判别信息,并有效聚合了高维SSL特征。

英文摘要

Self-supervised learning (SSL) yields powerful, context-rich representations for speech emotion recognition (SER), yet aggregating these representations into holistic descriptors remains a bottleneck. Conventional first-order aggregation implicitly assumes feature independence, which overlooks the latent Riemannian geometry and discards higher-order relationships essential to the representational power of the backbone. To address this problem, this paper proposes a novel Second-Order Correlation (SOC) layer. Instead of treating features in isolation, SOC models feature correlations as covariance descriptors to capture synergistic co-occurrence patterns, which serve as discriminative signatures for robust emotion recognition. By mapping these descriptors from the Riemannian manifold to a Euclidean tangent space through Log-Euclidean mapping (LEM), the proposed method preserves geometric integrity while enabling direct linear discriminative learning. Extensive experiments on the ESD and RAVDESS datasets demonstrate that SOC recovers discriminative information lost in first-order pooling and effectively aggregates high-dimensional SSL features.

2606.06559 2026-06-08 cs.SD cs.AI eess.AS 新提交

IRAF: Interference-Resilient Adaptive Fusion for Noise-Robust End-to-End Full-Duplex Spoken Dialogue Systems

IRAF:面向噪声鲁棒的端到端全双工口语对话系统的抗干扰自适应融合

Tao Zhong, Jiajun Deng, Nikita Kuzmin, Yinke Zhu, Tianxiang Cao, Tristan Tsoi, Zhili Tan, Simon Lui, Xunying Liu

发表机构 * The Chinese University of Hong Kong(香港中文大学) AudioLab Hong Kong, Huawei Leibniz Research Center(香港AudioLab,华为Leibniz研究中心) Nanyang Technological University(南洋理工大学)

AI总结 提出IRAF模块,通过逐帧预测可靠性门控来调节用户音频对LLM的贡献,提升全双工对话系统在干扰说话人环境下的响应质量和交互稳定性。

详情
AI中文摘要

全双工口语对话模型允许语音代理同时听和说,实现具有实时重叠的自然交互。然而,联合编码用户和代理流的端到端双通道模型在现实声学环境中可能会退化:干扰说话人泄漏到用户麦克风中,会被编码为用户查询的一部分,破坏LLM的条件,导致不稳定的轮流说话和响应质量下降。我们提出抗干扰自适应融合(IRAF),一个轻量级、流兼容的模块,逐帧调节用户音频对LLM的贡献。IRAF从目标说话人和用户音频嵌入中预测一个标量可靠性门控,并在与代理嵌入融合之前重新缩放用户表示。在MS-MARCO和InstructS2S-200K上的实验表明,在干扰说话人条件下,响应质量和全双工交互获得一致提升。

英文摘要

Full-duplex spoken dialogue models allow voice agents to listen and speak concurrently, enabling natural interaction with real-time overlap. However, end-to-end dual-channel models that jointly encode user and agent streams may degrade in realistic acoustic environments: interfering speakers leaking into the user microphone can be encoded as part of the user query, corrupting the LLM's conditioning and causing unstable turn-taking and reduced response quality. We propose Interference-Resilient Adaptive Fusion (IRAF), a lightweight, streaming-compatible module that modulates the contribution of user audio to the LLM frame by frame. IRAF predicts a scalar reliability gate from target-speaker and user audio embeddings and rescales user representations before fusion with agent embeddings. Experiments on MS-MARCO and InstructS2S-200K show consistent gains in response quality and full-duplex interaction under interfering-speaker conditions.

2606.06615 2026-06-08 cs.SD cs.AI cs.LG eess.AS 新提交

FIGMA: Towards FIne-Grained Music retrievAl

FIGMA:迈向细粒度音乐检索

Nishit Anand, Ashish Seth, Sreyan Ghosh, Dinesh Manocha, Ramani Duraiswami

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 针对现有音乐检索模型无法处理细粒度属性查询的问题,提出多视角对比架构FIGMA,通过联合优化全局音频-文本对齐和帧级标记对齐,在统一表示空间中捕获高层语义和细粒度音乐属性,并在新构建的细粒度音乐描述数据集上取得显著提升。

Comments Accepted to ACL 2026. Project Website: https://nishitanand.github.io/figma-website/

详情
AI中文摘要

使用自然语言描述检索音乐已通过对比音频-文本模型(如CLAP)得到改进,但当前系统仍局限于粗粒度语义查询。当描述指定细粒度音乐属性(如速度、调性、和弦进行或节奏结构)时,现有模型通常无法检索到正确的音频。我们表明,这一限制源于对比学习目标本身:尽管在长描述上训练,基于CLAP的模型实际上仅利用前几个标记,丢弃了详细提示中编码的大量信息。然后,我们提出FIGMA(细粒度音乐检索),一种多视角对比架构,通过联合优化全局音频-文本对齐和帧级、标记级对齐来解决这一限制。该设计使FIGMA能够在统一表示空间中捕获高层语义上下文和细粒度音乐属性。此外,我们形式化了细粒度音乐检索任务,并构建了细粒度音乐描述数据集(FGMCaps),一个包含38万音乐-描述对的大规模训练数据集以及1万测试集,两者都标注了速度、调性、和弦进行、节拍数以及流派和情绪。大量实验表明,FIGMA在多个音乐检索基准(包括域外评估)上持续优于现有基于CLAP的音乐检索模型,相对改进高达73.3%。

英文摘要

Retrieving music using natural language descriptions has improved with contrastive audio-text models such as CLAP, but current systems remain limited to coarse semantic queries. When descriptions specify fine-grained musical attributes such as tempo, key, chord progression, or rhythmic structure, existing models often fail to retrieve the correct audio. We show that this limitation stems from the contrastive learning objective itself: despite being trained on long captions, CLAP-based models effectively utilize only the first few tokens, discarding much of the information encoded in detailed prompts. Then, we propose FIGMA (FIne-Grained Music RetrievAl), a multi-view contrastive architecture that addresses this limitation by jointly optimizing global audio-text alignment and frame-level, token-wise alignment. This design enables FIGMA to capture both high-level semantic context and fine-grained musical attributes within a unified representation space. Moreover, we formalize the task of Fine-Grained Music Retrieval and construct Fine-Grained Music Caption dataset (FGMCaps), a large-scale dataset of 380K music-caption pairs for training along with a 10K test set, both annotated with tempo, key, chord progression, beat count, as well as genre and mood. Extensive experiments demonstrate that FIGMA consistently outperforms existing CLAP-based music retrieval models across multiple music retrieval benchmarks, including out-of-domain evaluations, with relative improvements of up to 73.3%.

2606.06740 2026-06-08 cs.SD cs.AI cs.CL 新提交

Multilingual Multi-Speaker Unit Vocoders: A Systematic Analysis of Discrete Speech Representations

多语言多说话人单元声码器:离散语音表示的系统分析

Naman Kothari, Arjun Gangwar, Adarsh Arigala, S Umesh

发表机构 * National Institute of Technology, Trichy(印度Trichy国家理工学院) Indian Institute of Technology, Madras(印度Madras理工学院)

AI总结 分析基于BigVGAN的单元声码器在多语言多说话人语音生成中的表现,发现聚类大小控制可懂度,显式说话人条件防止身份崩溃,语言监督在低聚类大小时有益。

Comments 5 pages, 5 tables, 1 figure, Accepted at Interspeech 2026

详情
AI中文摘要

通过k-means聚类自监督嵌入获得的离散语音单元纠缠了音素、说话人和语言信息,导致多语言多说话人语音生成中的说话人混合和跨语言干扰。尽管在音频大语言模型和语音到语音系统中使用日益增多,单元声码器仍然研究不足。我们分析了基于BigVGAN的单元声码器,涵盖四种印度语言。我们使用WER、说话人相似度和单元级指标研究了聚类大小与条件策略之间的相互作用。结果表明,聚类大小通过提高音素区分性来控制可懂度,而显式说话人条件对于防止身份崩溃不可或缺。语言监督主要在单元仍模糊的较小聚类大小时带来进一步收益。我们的分析显示,在较小库存时,不同语言中相似音素会坍缩到相同的聚类ID,而较大的聚类会逐渐将它们分离。

英文摘要

Discrete speech units obtained via k-means clustering of self supervised embeddings entangle phonetic, speaker, and language information, causing speaker mixing and cross-lingual interference in multilingual multi-speaker speech generation. Despite growing use in Audio LLMs and speech to speech systems, unit vocoders remain underexplored. We analyze a BigVGAN based unit vocoder, across four Indian languages. We study the interaction between cluster size and conditioning strategies using WER, speaker similarity, and unit level metrics. Results show that cluster size governs intelligibility by improving phonetic discriminability, while explicit speaker conditioning is indispensable for preventing identity collapse. Language supervision yields further gains mainly at lower cluster sizes where units remain ambiguous. Our analysis shows similar phonemes across languages collapse to the same cluster IDs at smaller inventories, with larger clusters progressively separating them.

2606.06743 2026-06-08 cs.SD cs.AI cs.CL 新提交

HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec

HybridCodec: 快速双流、语义增强的神经音频编解码器

Arjun Gangwar, S Umesh

发表机构 * Indian Institute of Technology, Madras(印度理工学院马德拉斯分校)

AI总结 提出HybridCodec,一种结合语义蒸馏与双流架构的统一神经音频编解码器,实现强解耦、跨语言鲁棒性及3倍速度提升。

Comments 5 pages, 5 tables, 1 figure, Accepted at Interspeech 2026

详情
AI中文摘要

随着多模态大语言模型的出现,神经音频编解码器作为语音分词器的流行度激增。具有语义和声学解耦的新编解码器架构已经出现。将语义信息引入编解码器模型有两种主要方法:一种是从SSL表示中将语义信息蒸馏到第一个RVQ层,另一种是维护语义和声学特征的独立流。我们提出HybridCodec,一种结合了两种范式的统一架构。它采用独立的语义和声学分枝,同时将SSL表示蒸馏到语义流中。这种设计确保了强解耦,而无需在推理期间使用SSL模型。HybridCodec在域内测试集上展示了优越的语义特化(RVQ-1)和有竞争力的重建(RVQ-all)。我们展示了其在域外和零样本跨语言设置中的鲁棒性,相比现有双流模型实现了3倍加速。

英文摘要

The popularity of neural audio codecs as speech tokenizers has surged with the advent of Multimodal Large Language Models. New codec architectures with semantic and acoustic disentanglement have emerged. There are two main approaches to introduce semantic information into codec models: one distills semantic information from SSL representations into the first RVQ layer, while the other maintains separate streams for semantic and acoustic features. We propose HybridCodec, a unified architecture that combines both paradigms. It employs separate semantic and acoustic branches while distilling SSL representations into the semantic stream. This design ensures strong disentanglement without requiring an SSL model during inference. HybridCodec shows superior semantic specialization (RVQ-1) on in-domain test set and competitive reconstruction (RVQ-all). We demonstrate its robustness in out-of-domain and zero-shot cross-lingual settings, achieving a 3x speedup over existing dual-stream models.

2606.06806 2026-06-08 cs.SD eess.AS 新提交

Leveraging Soft Distributions of SSL-Derived Discrete Speech Tokens for Downstream Inference

利用SSL导出的离散语音标记的软分布进行下游推理

Kentaro Onda, Satoru Fukayama, Daisuke Saito, Nobuaki Minematsu

发表机构 * The University of Tokyo(东京大学) National Institute of Advanced Industrial Science and Technology (AIST)(国家工业科学与技术研究院(AIST))

AI总结 提出在下游推理时使用软标记分配,保留硬离散化的训练效率同时增强推理时表达力,在ASR和语音合成任务上优于硬分配,并在非母语ASR上超越连续SSL特征。

Comments Accepted to Interspeech2026

详情
AI中文摘要

从自监督学习(SSL)模型获得的离散语音标记在保持强大性能的同时提供高效的数据压缩,并已广泛用作各种任务中的中间表示。然而,离散化不可避免地导致信息丢失,与连续SSL特征相比性能下降。在这项工作中,我们提出仅在下游推理期间应用软标记分配。这种方法保留了训练期间硬离散化的效率,同时增强了推理时标记的表达力。所提出的方法在ASR和语音合成任务上均优于传统的硬分配,并且对域外数据表现出特别强的泛化能力。对于非母语语音的ASR,它甚至超过了使用连续SSL特征的模型。此外,对所得表示的分析表明,与传统的硬分配相比,它们与音素的对齐更准确。

英文摘要

Discrete speech tokens obtained from self-supervised learning (SSL) models provide efficient data compression while maintaining strong performance, and have been widely used as intermediate representations in various tasks. However, discretization inevitably causes information loss, leading to degraded performance compared with continuous SSL features. In this work, we propose to apply soft token assignment only during downstream inference. This approach preserves the efficiency of hard discretization during training while enhancing the expressiveness of the tokens at inference. The proposed method outperforms conventional hard assignment on both ASR and speech synthesis tasks, and exhibits particularly strong generalizability to out-of-domain data. For ASR of non-native speech, it even surpasses models using continuous SSL features. Moreover, analysis of the resulting representations shows they align more accurately with phonemes compared with conventional hard assignment.

2606.06928 2026-06-08 cs.SD eess.AS 新提交

VoxCPM2 Technical Report

VoxCPM2 技术报告

Yixuan Zhou, Guoyang Zeng, Xin Liu, Xiang Li, Renjie Yu, Jiancheng Gui, Jiaheng Wu, Ziyang Wang, Xudong Shen, Runchuan Ye, Zhisheng Zhang, Jiuyang Zhou, Bingsong Bai, Weiyue Sun, Mengyuan Deng, Qundong Shi, Zhiyong Wu, Zhiyuan Liu

发表机构 * VoxCPM Team(VoxCPM 团队)

AI总结 提出VoxCPM2,一种全开源多语言可控语音生成基础模型,通过层次化扩散自回归建模、非对称AudioVAE和2B参数/200万小时数据扩展,在零样本和指令跟随TTS基准上达到SOTA,平均WER为1.68%。

Comments The technical report of VoxCPM2, a TTS foundation model (GitHub: https://github.com/OpenBMB/VoxCPM)

详情
AI中文摘要

我们提出VoxCPM2,一个完全开源的多语言可控语音生成基础模型,它扩展了VoxCPM的层次化扩散自回归建模范式。VoxCPM2在三个关键维度上推进了该框架:(i) 能力,通过统一30种语言、9种中文方言、自然语言语音设计、风格可控的语音克隆以及高保真延续克隆于单个骨干网络;(ii) 质量,通过非对称AudioVAE以16 kHz编码并以48 kHz重建,实现具有高编码效率的隐式超分辨率;(iii) 规模,通过将模型联合扩展到2B参数,训练数据超过200万小时的多语言语音。为了在单个模型中支持这些多样化的能力,我们引入了一种统一的序列组织方式,通过相同输入构建块的不同排列来表达所有生成模式,从而允许在单一参数集和目标下进行联合训练。VoxCPM2在公共零样本和指令跟随TTS基准上达到了最先进或具有竞争力的性能。在我们的内部30语言评估集上,它取得了平均1.68%的词错误率。这些结果表明,层次化连续潜在建模无需依赖任何外部离散语音分词器,为大规模多语言可控语音生成提供了可行且强大的基础。模型权重、微调代码和推理工具已在Apache 2.0许可下公开发布,以促进社区研究和开发。

英文摘要

We present VoxCPM2, a https://info.arxiv.org/help/prep#abstractsfully open-source multilingual and controllable speech generation foundation model that extends the hierarchical diffusion-autoregressive modeling paradigm of VoxCPM. VoxCPM2 advances the framework in three key dimensions: (i) capability, by unifying 30 languages, 9 Chinese dialects, natural-language voice design, style-controllable voice cloning, and high-fidelity continuation cloning within a single backbone; (ii) quality, through an asymmetric AudioVAE that encodes at 16 kHz and reconstructs at 48 kHz, enabling implicit super-resolution with high encoding efficiency; and (iii) scale, by jointly scaling the model to 2B parameters and the training data to over 2 million hours of multilingual speech. To support these diverse capabilities within one model, we introduce a unified sequence organization that expresses all generation modes through different arrangements of the same input building blocks, allowing joint training under a single set of parameters and objective. VoxCPM2 achieves state-of-the-art or competitive performance on public zero-shot and instruction-following TTS benchmarks. On our internal 30-language evaluation set, it attains an average WER of 1.68%. These results demonstrate that hierarchical continuous-latent modeling, without relying on any external discrete speech tokenizer, offers a viable and powerful foundation for large-scale multilingual and controllable speech generation. The model weights, fine-tuning code, and inference tools are publicly released under the Apache 2.0 license to foster community research and development.

2606.06975 2026-06-08 cs.SD eess.AS 新提交

MyGardenBird: A Machine-Learning-Ready Bird Sound Dataset for Twelve Common Malaysian Birds

MyGardenBird:针对十二种常见马来西亚鸟类的机器学习就绪鸟声数据集

Muhammad Mun'im Ahmad Zabidi, Mohd Yamani Idna Idris, Norisma Idris

发表机构 * Faculty of Computer Science and Information Technology, Universiti Malaya(马来大学计算机科学与信息技术学院) Faculty of Electrical Engineering, Universiti Teknologi Malaysia(技术学院电气工程学院)

AI总结 提出MyGardenBird数据集,包含来自Xeno-canto的12种马来西亚常见鸟类的7200个经过人工验证的音频片段,通过卷积神经网络基线实验达到92-96%的分类准确率。

Comments 17 pages, 9 figures

详情
AI中文摘要

来自热带地区的生物声学数据集仍然有限,部分原因是缺乏可重复的工作流程来聚合来自公共档案的录音。我们提出了\textbf{MyGardenBird},一个精心策划的鸟类发声数据集,代表马来西亚半岛和印度-马来亚地区的十二种常见物种。录音来自Xeno-canto,并通过物种级过滤、手动频谱图分割和质量控制检查进行处理。主要版本包含7,200个人工验证的音频片段(16 kHz,16位PCM单声道WAV),每个物种平衡600个三秒片段(总计6.0小时),来自1,381个不同的录音。元数据包括地理空间坐标、发声类别和信噪比(SNR)值(范围:0.83--59.18 dB;平均值:15.80 dB)。还提供了一个44.1 kHz的补充版本。为了减轻数据泄漏,数据集划分在源录音级别定义。使用卷积神经网络在梅尔频谱图上的基线分类实验达到了92--96%的测试准确率,表明种间可分性强。局限性包括依赖单一标注者进行策展;然而,使用BirdNET进行的验证确认了标签一致性。MyGardenBird在CC BY-NC-SA 4.0许可下于该https URL公开提供。随附完整的预处理代码以支持可重复性和未来扩展。

英文摘要

Bioacoustic datasets from tropical regions remain limited, in part due to the absence of reproducible workflows for aggregating recordings from public archives. We present \textbf{MyGardenBird}, a curated dataset of bird vocalisations representing twelve common species across Peninsular Malaysia and the Indo-Malayan region. Recordings were sourced from Xeno-canto and processed through species-level filtering, manual spectrogram segmentation, and quality control checks. The primary release comprises 7,200 manually validated audio clips (16 kHz, 16-bit PCM mono WAV), balanced at 600 three-second clips per species (6.0 hours total) derived from 1,381 distinct recordings. Metadata includes geospatial coordinates, vocalisation categories, and signal-to-noise ratio (SNR) values (range: 0.83--59.18 dB; mean: 15.80 dB). A supplementary 44.1 kHz version is also provided. To mitigate data leakage, dataset partitions are defined at the source-recording level. Baseline classification experiments using convolutional neural networks on Mel-spectrograms achieved test accuracies of 92--96\%, indicating strong interspecies separability. Limitations include reliance on single-annotator curation; however, validation with BirdNET confirmed label consistency. MyGardenBird is openly available at https://doi.org/10.5281/zenodo.20306877 under a CC BY-NC-SA 4.0 licence. Complete preprocessing code accompanies the release to support reproducibility and future expansion.

2606.07030 2026-06-08 cs.SD cs.AI cs.CL cs.LG 新提交

Phonetic Error Analysis of Raw Waveform Acoustic Models

原始波形声学模型的音素错误分析

Erfan Loweimi, Zhengjun Yue, Andrea Carmantini, Zoran Cvetkovic, Steve Renals, Peter Bell

发表机构 * Centre for Speech Technology Research (CSTR), University of Edinburgh, UK(语音技术研究中心(CSTR),爱丁堡大学,英国) Cisco, UK(思科公司,英国) SLAI & CUHK-SZ, China(SLAI与CUHK-SZ,中国) King's College London, UK(伦敦国王学院,英国)

AI总结 通过分解音素错误率、分析混淆矩阵,发现BLSTM层对过渡依赖类提升最大,WSJ迁移学习对辅音改进约是元音的三倍,且混淆模式反映固有音素相似性。

Comments INTERSPEECH2026

详情
AI中文摘要

我们分析了原始波形声学模型在TIMIT音素识别中的错误模式,超越了整体音素错误率(PER)。将PER按三个广义语音类别(BPC)分解,并从替换错误构建混淆矩阵。我们的模型将参数化(SincNet, Sinc2Net)或非参数化CNN与双向LSTM相结合,在开发/测试集上分别达到13.9%/15.3%的PER,这是原始波形模型在TIMIT上的最佳报告结果。来自WSJ的迁移学习将PER降至11.3%/12.3%,超越了Filterbank基线。每个BPC的分析表明,BLSTM层对过渡依赖类提升最大,而WSJ迁移学习对辅音的改进约是元音的三倍。原始波形和Filterbank系统的混淆模式一致,表明主要混淆反映了固有的音素相似性。

英文摘要

We analyse error patterns of raw waveform acoustic models on TIMIT phone recognition beyond the overall phone error rate (PER). PER is decomposed across three broad phonetic class (BPC) categorisations, and confusion matrices are constructed from substitution errors. Our models combine parametric (SincNet, Sinc2Net) or non-parametric CNNs with Bidirectional LSTMs, achieving 13.9%/15.3% PER on Dev/Test, the best reported results for raw waveform models on TIMIT. Transfer learning from WSJ reduces PER to 11.3%/12.3%, surpassing the Filterbank baseline. Per-BPC analysis reveals that BLSTM layers benefit transition-dependent classes most, while WSJ transfer learning improves consonants roughly three times more than vowels. Confusion patterns are consistent across raw waveform and Filterbank systems, indicating that the dominant confusions reflect inherent phonetic similarities.

2606.07080 2026-06-08 cs.SD cs.AI eess.AS 新提交

dots.tts Technical Report

dots.tts 技术报告

Shi Lian, Changtao Li, Bohan Li, Hankun Wang, Da Zheng, Junfeng Tian, Yufeng Ma, Colin Zhang, Kai Yu

发表机构 * ByteDance(字节跳动)

AI总结 提出一个20亿参数的连续自回归TTS基础模型,通过多目标AudioVAE、全历史条件流匹配和无奖励自校正后训练,在Seed-TTS-Eval上取得最优性能,并支持低延迟推理。

详情
AI中文摘要

我们提出了 this http URL,一个20亿参数的连续自回归文本到语音(TTS)基础模型,在连续潜在空间中建模语音。与现有的连续自回归模型相比,我们的关键创新有三点。首先,我们训练了一个具有多目标的AudioVAE,以构建语义结构化和预测友好的连续语音空间。其次,我们在流匹配头中使用全历史条件,以保持长程一致性并减少生成过程中的漂移。第三,我们对流匹配头应用无奖励自校正后训练,以进一步提高鲁棒性和声学质量。在大规模多语言语料库上训练后,this http URL 在Seed-TTS-Eval上取得了最佳平均性能,在zh/en/zh-hard测试集上的WER分别为0.94%/1.30%/6.60%,SIM分数分别为81.0/77.1/79.5。在其他基准测试中,this http URL 也持续展示了开源最先进的性能,表现出强大的生成稳定性、声音克隆能力和情感表现力。为了实现高效推理,我们进一步应用了CFG感知的MeanFlow蒸馏,使得输出流和双流模式下的首包延迟分别为85毫秒和54毫秒,实现了低延迟语音生成。为了促进可重复研究和实际部署,我们在Apache 2.0许可下发布了训练和推理代码,以及预训练、后训练和MeanFlow蒸馏的检查点。

英文摘要

We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift during generation. Third, we apply reward-free self-corrective post-training to the flow-matching head to further improve robustness and acoustic quality. After being trained on a large-scale multilingual corpus, dots.tts achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94%/1.30%/6.60% and SIM scores of 81.0/77.1/79.5 on the zh/en/zh-hard test sets, respectively. Across other benchmarks, dots.tts also consistently demonstrates open-source state-of-the-art performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness. For efficient inference, we further apply CFG-aware MeanFlow distillation, enabling low-latency speech generation with first-packet latencies of 85/54 ms in output streaming and dual-streaming modes, respectively. To facilitate reproducible research and practical deployment, we release the training and inference code, together with the pretrained, post-trained, and MeanFlow-distilled checkpoints, under the Apache 2.0 license.

2606.07207 2026-06-08 cs.SD cs.LG eess.AS 新提交

Entropy as a Structural Prior: How a Log-Barrier on DiT Belief Space Drives Musical Diversity and Development

熵作为结构先验:DiT信念空间上的对数障碍如何驱动音乐多样性与发展

Zixi Li, Youzhen Li

发表机构 * Sun Yat-sen University(中山大学) Datawhale(数据 whale)

AI总结 提出Eisbach对数障碍,利用DiT输出空间能量分布的熵作为权重,在监督扩散训练中通过调节梯度步长促进音乐主题发展、声学区分和纹理多样性,避免模式崩溃。

详情
AI中文摘要

基于置信度的损失加权通常在生成模型中被避免,因为当模型自信地错误时会加速误差,但这种直觉在监督扩散训练中不成立。我们引入了Eisbach对数障碍,一种无参数权重,源自DiT输出空间能量分布的熵:高熵抑制梯度,低熵保留梯度。将其应用于Stable Audio 3 Medium在MusicCaps上的LoRA微调,意外地产生了比未加权训练更强的主题发展、更清晰的声学区分和更高的纹理多样性,这与模式崩溃相反。这是因为在监督扩散中,梯度方向锁定于真实值,因此置信度仅缩放步长,并且因为时间熵对平坦样本降权而保留高对比度样本。结果是一个在线、自引用的数据课程,完全从前向传播中涌现,并分析了噪声级动态和可测试的预测。

英文摘要

Confidence-based loss weighting is usually avoided in generative models because it accelerates errors when the model is confidently wrong, but this intuition breaks down in supervised diffusion training. We introduce the Eisbach log-barrier, a parameter-free weight derived from the entropy of the DiT output's spatial energy distribution: high entropy damps the gradient, while low entropy preserves it. Applied to LoRA fine-tuning of Stable Audio 3 Medium on MusicCaps, it unexpectedly yields stronger thematic development, clearer acoustic differentiation, and higher textural diversity than unweighted training, the opposite of mode collapse. This works because in supervised diffusion the gradient direction is locked to ground truth, so confidence only scales the step size, and because temporal entropy downweights flat samples while preserving high-contrast ones. The result is an online, self-referential data curriculum that emerges purely from the forward pass, with analyzed noise-level dynamics and testable predictions.

2606.07210 2026-06-08 cs.SD cs.CR 新提交

A Large-Scale Per-Speaker Analysis of Re-identification Risk in Speech Anonymization

语音匿名化中重识别风险的大规模每说话人分析

Orane Dufour, Paul Magron, Mickael Rouvier, Emmanuel Vincent

发表机构 * Université de Lorraine, CNRS, Inria, LORIA(洛林大学、国家科学研究中心、法国国家信息与自动化技术研究院、LORIA实验室) LIA, Avignon University(阿维尼翁大学LIA实验室)

AI总结 通过大规模每说话人分析,发现语音匿名化中重识别风险在个体间差异巨大,且风险由攻击者、匿名化器和可用语音量共同决定,挑战了固有说话人隐私风险的概念。

Comments Accepted to Interspeech

详情
AI中文摘要

语音匿名化通常使用平均情况指标(如等错误率)进行评估,这可能会掩盖个体间重识别风险的巨大差异。在本文中,我们基于最坏情况下的可链接性度量,进行了大规模每说话人隐私分析。评估了近5000名说话人在多个匿名化系统、攻击者架构和对话长度下的表现。虽然可链接性分数在说话人层面上高度极化,但易于重识别和难以重识别的说话人集合在不同配置下差异显著。我们表明,没有单一因素可以解释说话人的脆弱性。相反,重识别风险源于攻击者、匿名化器和可用语音量之间的相互作用。这些结果挑战了固有说话人级隐私风险的概念,并强调需要明确以攻击者和匿名化器为条件的评估协议。

英文摘要

Speech anonymization is commonly evaluated using averagecase metrics such as the equal error rate, which can hide large disparities in re-identification risks across individuals. In this paper, we conduct a large-scale per-speaker privacy analysis using a linkability-based metric under a worst-case scenario. Nearly 5,000 speakers are evaluated across multiple anonymization systems, attacker architectures, and conversation lengths. While linkability scores are highly polarized at the speaker level, the sets of easy to re-identify and hard to re-identify speakers vary substantially across configurations. We show that no single factor explains speaker vulnerability. Instead, the re-identification risk emerges from the interaction between the attacker, the anonymizer, and the amount of available speech. These results challenge the notion of intrinsic speaker-level privacy risks and emphasize the need for evaluation protocols that are explicitly conditioned on the attacker and anonymizer.

2606.07229 2026-06-08 cs.SD cs.CL cs.MM 新提交

MMAE: A Massive Multitask Audio Editing Benchmark

MMAE:大规模多任务音频编辑基准

Ziyang Ma, Ruiqi Yan, Ruiyang Xu, Jie Fang, Zhikang Niu, Yi-Wen Chao, Wenming Tu, Tianrui Wang, Auden, Qi Chen, Wenxi Chen, Jiaying Chi, Yanru Huo, Zixuan Jiang, Xiquan Li, Yalin Li, Junxi Liu, Minghao Liu, Binghao Qiang, Yijia Shan, Zheshu Song, Tian Tan, Zixiang Wang, Zeyu Xie, Zhifei Xie, Xiaoyu Xing, Qixiang Xu, Chen Yang, Guanrou Yang, Shan Yang, Yifan Yang, Steve Yves, Haotian Zhang, Haina Zhu, Kai Yu, Liefeng Bo, Eng-Siong Chng, Xie Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) Nanyang Technological University(南洋理工大学) Hunyuan Team, Tencent(腾讯 Hunyuan 团队) Tianjin University(天津大学) Fudan University(复旦大学)

AI总结 提出首个面向通用指令音频编辑的综合评估基准MMAE,涵盖7种音频模态、6级任务复杂度和8种操作类型,通过2000个样本和基于评分标准的评估框架揭示当前模型在精确执行和结构鲁棒性上的严重不足。

Comments Open-Source at https://github.com/ddlBoJack/MMAE

详情
AI中文摘要

我们引入了MMAE,一个大规模多任务音频编辑基准,作为首个专为通用指令式音频编辑设计的综合评估测试平台。受智能创作趋势的推动,交互式编辑已从视觉领域(如图像领域的Nano-banana 2和视频领域的Gemini-Omni)迅速扩展到音频领域。然而,当前的评估基础设施严重滞后,仍然高度碎片化且局限于特定子领域或基本操作。与现有范围有限的基准不同,MMAE扩展到广泛的实际场景,涵盖7种不同的音频模态,包括声音、语音、音乐及其混合。此外,我们建立了一个全面的分类体系,涵盖6级任务复杂度(从基本修改到多跳推理和多轮编辑)、2级粒度以及8种不同的操作类型。通过人机协作精心策划,MMAE包含2000个高保真样本,并配以开创性的基于评分标准的评估框架。通过将自由形式任务分解为17,741个可验证的标准,这种稳健的基于评分标准的范式能够对指令遵循和上下文一致性进行精确的多维评估。我们对领先模型的广泛评估表明,当前系统远未实现可靠的编辑。令人惊讶的是,精确匹配率(EMR)始终低于5%,在复杂的混合模态任务中更是骤降至绝对的0%,暴露了精确执行和结构鲁棒性方面的关键瓶颈。我们希望MMAE能够成为智能创作社区未来进步的催化剂,提供清晰的诊断路线图,并为下一代音频编辑系统建立标准化、持久的评估范式。

英文摘要

We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.

2606.07293 2026-06-08 cs.SD cs.LG 新提交

TargetSEC: Plug-and-Play In-the-Wild Speech Emotion Conversion via Arousal-Conditioned Latent Style Diffusion

TargetSEC: 基于唤醒度条件潜在风格扩散的即插即用野外语音情感转换

Constantin Alexander Auga

发表机构 * Hasso Plattner Institute / University of Potsdam(霍普特尔研究所 / 波茨坦大学)

AI总结 提出TargetSEC,一种基于嵌入驱动的潜在扩散框架,通过连续情感条件生成情感风格嵌入,在紧凑潜在空间操作,实现高转换精度和语音质量。

Comments 5 pages, 2 figures, 2 tables, preprint

详情
AI中文摘要

语音情感转换旨在将源话语的情感转换为目标情感,同时保留内容和说话人身份。由于训练数据的非平行性和复杂真实世界声学,野外数据的SEC具有挑战性。现有的固定时长方法要么难以有效转移情感(高质量、低转换),要么降低语音自然度(低质量、高转换)。我们提出TargetSEC,一种嵌入驱动的潜在扩散框架,根据说话人身份和连续情感生成以情感为中心的风格嵌入。与在频谱图上扩散的方法不同,TargetSEC在紧凑潜在空间中操作。在MSP-Podcast数据集上的实验表明,TargetSEC在转换准确性上优于当前非时长基线,同时保持高语音质量,并且在没有显式时间建模的情况下实现了与时长预测系统相当的性能。

英文摘要

Speech Emotion Conversion (SEC) aims to transform the emotion of a source utterance into a target emotion while preserving content and speaker identity. SEC on in-the-wild data is challenging due to the non-parallel nature of training data and complex real-world acoustics. Existing fixed-duration approaches either struggle to shift the emotion effectively (high quality, low conversion) or degrade speech naturalness (low quality, high conversion). We propose TargetSEC, an embedding-driven latent diffusion framework that generates emotion-focused style embeddings conditioned on speaker identity and continuous emotion. Unlike methods that diffuse over spectrograms, TargetSEC operates in a compact latent space. Experiments on the MSP-Podcast dataset show that TargetSEC outperforms current non-duration baselines in conversion accuracy while maintaining high speech quality, and achieves performance comparable to duration-prediction systems without explicit temporal modeling.

2606.07309 2026-06-08 cs.SD cs.AI cs.CL 新提交

Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

语音情感识别中音频语言模型的声学线索对齐

Iosif Tsangko, Andreas Triantafyllopoulos, Björn W. Schuller

发表机构 * DFG's Reinhart Koselleck project(德国科研基金Reinhart Koselleck项目) EU H2020 project(欧盟H2020项目)

AI总结 研究音频语言模型中显式声学线索的对齐性,通过eGeMAPS特征提取六种可解释声学概念标记,发现对齐标记提升UAR,而错乱标记降低性能,模型对符号线索敏感但仍部分依赖音频信号。

Comments 6 pages, 3 figures, 3 tables

详情
AI中文摘要

指令跟随音频语言模型(ALMs)可以通过显式的声学线索进行增强,但在原始音频已经可用的情况下,这些线索是否以接地的方式被使用仍不清楚。我们通过从标准化的eGeMAPS副语言特征集中推导出六个可解释的声学概念标记来研究语音情感识别(SER)中的这一问题。这些标记总结了能量、音高、动态、亮度、共振峰和语音质量,并被附加到文本提示中,同时保持音频输入不变。在广泛使用的FAU-Aibo和IEMOCAP基准测试中,对齐的标记提高了未加权平均召回率(UAR),而打乱、冲突或损坏的标记相对于对齐标记降低了性能,并将混淆转向中性。重要的是,在强标记扰动下预测不会崩溃,这表明模型对符号线索通道敏感,但部分仍锚定于音频信号。我们认为,仅标记干预提供了一种实用的方法来探测基于ALM的情感计算中音频接地线索的使用、鲁棒性和可解释性。

英文摘要

Instruction-following audio language models (ALMs) can be augmented with explicit acoustic cues, yet it remains unclear whether such cues are used in a grounded way when the raw audio is already available. We study this question in speech emotion recognition (SER) by deriving six interpretable acoustic concept tokens from the standardised eGeMAPS paralinguistic feature set. These tokens summarise energy, pitch, dynamics, brightness, formants, and voice quality, and are appended to the textual prompt while the audio input is kept unchanged. Across the widely used FAU-Aibo and IEMOCAP benchmarks, aligned tokens improve unweighted average recall (UAR), whereas shuffled, conflicting, or corrupted tokens reduce performance relative to aligned tokens and shift confusions toward neutral. Importantly, predictions do not collapse under strong token perturbations, suggesting that the models are sensitive to the symbolic cue channel but remain partly anchored to the audio signal. We argue that token-only interventions provide a practical way to probe audio-grounded cue use, robustness, and interpretability in ALM-based affective computing.

2606.07356 2026-06-08 cs.SD cs.CL 新提交

DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

DirectAudioEdit: 基于扩散预测对比的无反演文本引导音频编辑

Zhengkun Ge, Xiaoqian Liu, Haoran Zhang, Yuan Ge, Junxiang Zhang, Zhengtao Yu, Jingbo Zhu, Tong Xiao

发表机构 * School of Computer Science and Engineering, Northeastern University, Shenyang, China(东北大学计算机科学与工程学院) Kunming University of Science and Technology(昆明理工大学) NiuTrans Research, Shenyang, China(新译研究)

AI总结 提出一种无需训练和反演的文本引导音频编辑方法DirectAudioEdit,通过扩散预测对比构建编辑路径,在音乐和事件基准上降低FAD和KL指标15%以上,编辑速度提升高达64.5%。

详情
AI中文摘要

文本引导音频编辑旨在修改语言指定的声学内容,同时保留与编辑无关的源组件。现有的无训练方法通常依赖于基于反演的编辑。虽然无反演编辑因其减少计算开销和重构误差而具有吸引力,但在音频编辑中仍基本未被探索。关键挑战是通过扩散去噪动力学构建源到目标的编辑路径。在本文中,我们介绍了DirectAudioEdit,这是首次尝试开发一种无需训练和反演的音频编辑方法。在两个骨干网络上的音乐和事件级基准实验表明,与DDPM反演相比,DirectAudioEdit将宏观平均FAD和KL分别降低了15.9%和15.8%,同时实现了高达64.5%的编辑加速。

英文摘要

Text-guided audio editing aims to modify the language-specified acoustic content while preserving edit-irrelevant source components. Existing training-free methods typically rely on inversion-based editing. While inversion-free editing is appealing as it decreases computational overhead and reconstruction errors, it remains largely unexplored for audio editing. The key challenge is to construct a source-to-target editing path through diffusion denoising dynamics. In this paper, we introduce DirectAudioEdit, the first attempt to develop a training-free and inversion-free method for audio editing. Experiments on music and event-level benchmarks across two backbones show that DirectAudioEdit reduces macro-averaged FAD and KL by 15.9% and 15.8% compared with DDPM inversion, while achieving up to 64.5% editing speedup.

2606.07397 2026-06-08 cs.SD 新提交

Audio-Oscar: A Multi-Agent System for Complex Audio Scene Generation, Orchestration, and Refinement

Audio-Oscar: 一个用于复杂音频场景生成、编排和优化的多智能体系统

Yifan Duan, Qixiang Xu, Hengtao Wu, Zhanxun Liu, Wenhao Guan, Junxi Liu, Ziyang Ma, Kelu Xu, Xie Chen

发表机构 * MoE Key Lab of Artificial Intelligence(人工智能混合专家实验室) X-LANCE Lab(X-LANCE实验室) Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) Shanghai AI Laboratory(上海人工智能实验室) Xiamen University(厦门大学) State Key Laboratory of Complex & Critical Software Environment, China(复杂与关键软件环境国家重点实验室,中国)

AI总结 提出Audio-Oscar多智能体框架,通过协调多个专业智能体处理角色建模、语音生成、时间线规划等,实现复杂音频场景的生成与优化,并构建ASG-Bench基准进行评估。

详情
AI中文摘要

近年来,音频生成在文本到语音(TTS)、文本到音频(TTA)和文本到音乐(TTM)等任务上取得了显著进展。然而,从复杂的音频场景描述中生成长格式且可控的音频仍然是一个重大挑战,因为此类场景通常需要协调语音、音效、音乐、歌曲、时间结构以及后期制作。在这项工作中,我们引入了 \textbf{Audio-Oscar},一个用于从复杂描述生成音频的多智能体框架。Audio-Oscar 协调一组专业智能体,每个智能体负责音频场景的不同方面,包括角色建模和声音设计、语音生成、细粒度时间线规划、模型选择、非语音生成以及音频后期制作。Audio-Oscar 还整合了反馈驱动的优化。此外,为了解决缺乏从复杂音频场景描述评估音频生成的合适基准的问题,我们构建了 \textbf{ASG-Bench},一个音频场景生成基准,包含与参考音频配对的场景描述和纯文本场景描述。每个场景都标注了目标音频事件和时间语句,以评估生成的音频是否忠实地实现了所需的场景内容和时间结构。实验结果表明,Audio-Oscar 能够有效生成与复杂场景描述匹配的音频。项目样本可在该 https URL 获取。我们的代码可在该 https URL 获取。

英文摘要

In recent years, audio generation has made significant progress in tasks such as text-to-speech (TTS), text-to-audio (TTA) and text-to-music (TTM). However, generating long-form and controllable audio from complex audio scene descriptions remains a significant challenge, as such scenes often require coordinated speech, sound effects, music, songs, temporal structure, and post-production. In this work, we introduce \textbf{Audio-Oscar}, a multi-agent framework for generating audio from complex descriptions. Audio-Oscar coordinates a set of specialist agents, each responsible for a different aspect of the audio scene, including character modeling and voice design, speech generation, fine-grained timeline planning, model selection, non-speech generation, and audio post-production. Audio-Oscar further incorporates feedback-driven refinement. In addition, to address the lack of suitable benchmarks for evaluating audio generation from complex audio scene descriptions, we construct \textbf{ASG-Bench}, an Audio Scene Generation Benchmark containing both scene descriptions paired with reference audio and text-only scene descriptions. Each scene is annotated with target audio events and temporal statements to evaluate whether the generated audio faithfully realizes the required scene content and temporal structure. Experimental results show that Audio-Oscar can effectively generate audio that matches complex scene descriptions. Project samples are available at https://audiooscar.github.io/. Our code is available at https://github.com/ziye26/Audio-Oscar.

2606.07473 2026-06-08 cs.SD cs.AI 新提交

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

Whisper 幻觉检测与缓解:基于隐藏表示引导和稀疏自编码器

Georgii Aparin, Vadim Popov, Tasnima Sadekova, Assel Yermekova

发表机构 * AI Foundation and Algorithm Lab(AI基础与算法实验室) National University of Science and Technology MISIS(科学与技术国立大学MISIS) National Research University Higher School of Economics(国家研究大学经济高等学院)

AI总结 通过分析Whisper内部表示,提出基于稀疏自编码器的引导策略,将非语音测试集上的幻觉率从72.63%降至14.11%(small模型),接近微调方法性能。

详情
AI中文摘要

Whisper是一种广泛采用的ASR模型,已知存在幻觉问题——即对非语音音频生成与输入完全无关的连贯转录。我们研究了是否可以通过Whisper的内部表示来检测和缓解幻觉。我们提取音频编码器激活,并评估两种表示空间:原始Whisper激活和稀疏自编码器(SAE)潜在变量。我们表明,两个空间都编码了线性可分的幻觉相关信息,判别能力集中在稀疏特征子集中,并向更深编码器层增强。我们提出了两种引导策略:激活空间引导和SAE潜在空间引导。基于SAE的引导将完整非语音测试集上的幻觉率从72.63%降至14.11%(Whisper small),从86.88%降至27.33%(Whisper large-v3),同时在语音数据上WER退化很小,接近基于微调方法的性能。

英文摘要

Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generated for non-speech audio entirely disconnected from the input. We investigate whether hallucinations can be detected and mitigated through Whisper's internal representations. We extract audio encoder activations and evaluate two representation spaces: raw Whisper activations and Sparse AutoEncoder (SAE) latents. We show that both spaces encode linearly separable hallucination-related information, with discriminative power concentrated in a sparse feature subset and increasing toward deeper encoder layers. We propose two steering strategies: activation-space steering and SAE latent-space steering. SAE-based steering reduces hallucination rate from 72.63% to 14.11% for Whisper small and from 86.88% to 27.33% for Whisper large-v3 on the full non-speech test set, with small WER degradation on speech data, approaching the performance of fine-tuning-based methods.

2606.07494 2026-06-08 cs.SD eess.AS 新提交

Mitigating Proxy-to-Wild Domain Gap in Deepfake Speech

缓解深度伪造语音中的代理到真实域差距

Xuanjun Chen, Yun-Shing Wu, Wei-Chung Lu, Claire Lin, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang

发表机构 * Graduate Institute of Communication Engineering, National Taiwan University(国立台湾大学通信工程研究所) Graduate Institute of Networking and Multimedia, National Taiwan University(国立台湾大学网络与多媒体研究所) Department of Information Management, National Taiwan University(国立台湾大学信息管理系) NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE)(国立台湾大学人工智能研究中心卓越计划)

AI总结 提出域偏移特征增强(DSFA)方法,通过将确定性特征统计转换为随机分布来缩小代理数据与真实世界之间的域差距,在CoSG ExtEval数据集上达到最先进性能。

Comments Work in progress

详情
AI中文摘要

最近的基于神经音频编解码器的语音生成(CodecFake)产生了高度逼真的音频,对现有的深度伪造反制模型构成了挑战。虽然使用编解码器重合成语音(CoRS)作为代理数据可以提高性能,但它通常泛化能力有限。我们提出了域偏移特征增强(DSFA),通过在微调期间将确定性特征统计转换为随机分布来模拟“真实世界”的变化。为了评估泛化能力,我们进一步引入了基于编解码器的语音生成扩展评估(CoSG ExtEval)数据集,这是CoSG Eval(来自CodecFake+)数据集的更具挑战性的扩展,包含40个未见过的生成模型和长音频。实验结果表明,将后训练的SSL骨干与DSFA相结合有效地缩小了代理到真实世界的域差距。该方法在CoSG Eval和CoSG ExtEval中针对各种CodecFake攻击均达到了最先进的性能。

英文摘要

Recent neural audio codec-based speech generation (CodecFake) produces highly realistic audio, posing a challenge to existing deepfake countermeasure models. While using codec resynthesized speech (CoRS) as proxy data improves performance, it often suffers from limited generalization. We propose Domain-Shift Feature Augmentation (DSFA), which simulates "in-the-wild" variations by transforming deterministic feature statistics into stochastic distributions during fine-tuning. To evaluate generalization, we further introduce Codec-based Speech Generation Extension Evaluation (CoSG ExtEval) dataset, a more challenging extension of the CoSG Eval (from CodecFake+) dataset, featuring 40 unseen generative models and long-form audio. Experimental results demonstrate that combining a post-trained SSL backbone with DSFA effectively narrows the proxy-to-wild domain gap. This approach achieves state-of-the-art performance across diverse CodecFake attacks in both CoSG Eval and CoSG ExtEval.

2606.06795 2026-06-08 eess.AS cs.SD 交叉投稿

BiEAR: A Human Auditory-Inspired Adaptive Binaural Front-end for Multi-Speaker Localisation and Distance Estimation

BiEAR: 一种受人类听觉启发的自适应双耳前端,用于多说话人定位和距离估计

Hanyu Meng, Eliathamby Ambikairajah, Vidhyasaharan Sethu, Qiquan Zhang, Haizhou Li

发表机构 * The University of New South Wales(新南威尔士大学) Tongyi Speech Lab, Alibaba Group(通义语音实验室,阿里巴巴集团) School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen(人工智能学院,香港中文大学(深圳))

AI总结 提出受人类听觉启发的自适应双耳前端BiEAR,通过神经控制器动态调整滤波器组频率选择性,提升多说话人定位和距离估计的准确性与鲁棒性。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

我们提出BiEAR,一种受人类听觉启发的自适应双耳前端,用于多说话人定位和距离估计。受人类听觉中内侧橄榄耳蜗(MOC)反馈的启发,BiEAR使用神经控制器在推理过程中自适应调整双耳听觉滤波器组的频率选择性。这为双耳产生时频自适应表示,使模型能够响应变化的声学条件。我们在消声和真实房间环境中评估了BiEAR在多说话人定位和距离估计上的性能。结果表明,与常用的固定双耳前端相比,自适应前端提高了定位准确性以及对未见说话人和房间的鲁棒性。对学习到的滤波器自适应的可视化和分析表明,BiEAR随时间强调信息丰富的频带。这些发现表明,自适应的、受生物启发的双耳前端可以改善机器在复杂声学场景中的听觉鲁棒性。

英文摘要

We present BiEAR, a human auditory-inspired adaptive binaural front-end for multi-speaker localisation and distance estimation. Inspired by medial olivocochlear (MOC) feedback in human hearing, BiEAR uses a neural controller to adaptively adjust the frequency selectivity of a binaural auditory filterbank during inference. This yields time-frequency adaptive representations for ears, enabling the model to respond to changing acoustic conditions. We evaluate BiEAR on multi-speaker localisation and distance estimation in anechoic and real-room environments. Results show that the adaptive front-end improves localisation accuracy and robustness to unseen speakers and rooms compared with commonly used fixed binaural front-ends. Visualisation and analysis of learned filter adaptations show that BiEAR emphasises informative frequency bands over time. These findings suggest that adaptive, biologically inspired binaural front-ends can improve machine hearing robustness in complex acoustic scenes.

2606.06907 2026-06-08 eess.AS cs.AI cs.SD 交叉投稿

SpectCount: Spectrotemporal Counting via Synthetic Signals Improves Large Audio Language Models

SpectCount: 通过合成信号进行频谱时间计数改进大型音频语言模型

Seonuk Kim, Yonghyeon Jun, Ju Yeon Kang, Jimin Hong, Yoonhyeong Lee, Nam Soo Kim

发表机构 * Department of Electrical and Computer Engineering and INMC, Seoul National University, Seoul, South Korea(电气与计算机工程系和INMC,首尔国立大学,首尔,韩国)

AI总结 针对大型音频语言模型在频谱时间感知上的弱点,提出SpectCount方法,利用动态生成的完全合成音频信号进行数据高效微调,无需真实音频或标注,显著提升多种听觉基准性能。

Comments 5 pages, 5 figures

详情
AI中文摘要

大型音频语言模型(LALMs)通过音频编码器和大规模音频数据扩展了大型语言模型。然而,高质量标注音频数据的稀缺性仍然是扩展的根本瓶颈。通过探测信号可检测性分析,我们识别出基础LALM在细粒度频谱时间感知上的弱点。为了解决这些挑战,我们提出频谱时间计数(SpectCount),一种基于动态生成的完全合成音频信号的数据高效微调方法,无需依赖真实世界音频、标注或预训练生成模型。SpectCount不仅解决了观察到的弱点,还在微调期间未见的声音、音乐和语音等多种听觉基准上提升了性能。这些结果表明,针对弱点的合成信号为LALMs增强听觉理解能力提供了一条数据高效的途径。

英文摘要

Large audio language models (LALMs) extend large language models with an audio encoder and large-scale audio data. However, the scarcity of high-quality annotated audio data remains a fundamental bottleneck for scaling. Through probing signal detectability analysis, we identify fine-grained spectrotemporal perceptual weaknesses in a foundation LALM. To address these challenges, we propose Spectrotemporal Counting (SpectCount), a data-efficient fine-tuning approach based on fully synthetic audio signals generated on-the-fly, without relying on real-world audio, annotations, or pretrained generative models. SpectCount not only resolves the observed weaknesses but also improves performance on diverse auditory benchmarks spanning sound, music, and speech, unseen during fine-tuning. These results suggest that weakness-targeted synthetic signals provide a data-efficient path toward enhanced auditory understanding capabilities in LALMs.

2606.07240 2026-06-08 cs.CL cs.SD 交叉投稿

KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

KIT 提交至 IWSLT 2026 跨语言语音克隆任务

Seymanur Akti, Alexander Waibel

发表机构 * Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院) Carnegie Mellon University (CMU)(卡内基梅隆大学) KIT Campus Transfer (KCT)(KIT校区转移)

AI总结 针对跨语言语音克隆中的口音变化和领域词汇问题,基于FishAudio-S2-Pro多语言文本转语音模型,引入语言标签提示、强化学习微调和参考条件词汇匹配方法,提升可懂度和自然度。

详情
AI中文摘要

跨语言语音克隆旨在在保留源语言参考说话者身份的同时,生成目标语言的语音。该任务是语音翻译的核心,也是IWSLT 2026跨语言语音克隆轨道的焦点。一个关键挑战是在口音变化和领域特定词汇存在的情况下保持可懂度和自然度。我们基于多语言文本转语音模型FishAudio-S2-Pro,引入语言标签提示以改善语言控制并减少口音泄漏。我们进一步应用强化学习(RL)微调进行任务适应,并观察到可懂度的提升。最后,我们提出了一种参考条件词汇匹配方法,在词汇重叠时改善领域特定术语的发音。结果表明,语言提示带来了最大的增益,而词汇匹配在匹配子集上产生了一致的改进。

英文摘要

Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026 Cross-Lingual Voice Cloning track. A key challenge is maintaining intelligibility and naturalness in the presence of accent variation and domain-specific vocabulary. We build on a multilingual text-to-speech model, FishAudio-S2-Pro, and introduce language tag prompting to improve language control and reduce accent leakage. We further apply reinforcement learning (RL) fine-tuning for task adaptation and observe improvements in intelligibility. Finally, we propose a reference-conditioned lexical matching method that improves pronunciation of domain-specific terms when lexical overlap is present. Results show that language prompting provides the largest gains, while lexical matching yields consistent improvements on matched subsets.

2606.07259 2026-06-08 eess.AS cs.SD 交叉投稿

Assessing True Generalisability of Audio-Visual Speech Recognisers

评估音视频语音识别器的真正泛化能力

Zhaofeng Lin, Stavros Petridis, Maja Pantic, Naomi Harte

发表机构 * Trinity College Dublin(都柏林三一学院) Imperial College London(伦敦帝国理工学院)

AI总结 通过构建与LRS3测试集严格匹配的评估集,发现当前最先进的音视频语音识别模型在未见数据上性能全面崩溃,揭示了其泛化能力不足,并分析了退化原因、词汇偏差和错误模式。

Comments Accepted to Interspeech 2026 Long paper track. 9 pages, 4 figures

详情
AI中文摘要

当前的音视频语音识别(AVSR)模型在标准LRS3基准上实现了近乎完美的性能,引发了对自适应过拟合的担忧。为了系统评估真正的泛化能力,我们从大规模MultiVSR数据集中构建了一个高度可控、未见过的评估子集。与标准的分布外基准不同,我们的子集在声学、视觉和人口统计分布上与LRS3测试集严格匹配。评估五种最先进的架构揭示了普遍的性能崩溃,证明当前系统即使在严格对齐的条件下也无法泛化。通过跨七个因素的细粒度属性分析,我们隔离了这种退化的具体驱动因素。此外,我们发现了深刻的词汇偏差,揭示了不同的错误模式,并令人惊讶地发现音视频性能甚至落后于纯音频设置。我们发布了匹配的测试集,用于未来的基准测试。

英文摘要

Current Audio-Visual Speech Recognition (AVSR) models achieve near-perfect performance on the standard LRS3 benchmark, raising concerns of adaptive overfitting. To systematically assess true generalisability, we construct a highly controlled, unseen evaluation set subsampled from the massive MultiVSR dataset. Unlike standard out-of-distribution benchmarks, our subset strictly matches the acoustic, visual, and demographic distributions of the LRS3 test set. Evaluating five state-of-the-art architectures reveals a universal performance collapse, proving that current systems fail to generalise even under strictly aligned conditions. Through a fine-grained attribute analysis across seven factors, we isolate the specific drivers of this degradation. Furthermore, we uncover a profound lexical bias, expose distinct error patterns, and surprisingly reveal that audio-visual performance even lags behind audio-only settings. We release our matched test set for future benchmarking.

2606.07271 2026-06-08 cs.LG cs.AI cs.SD 交叉投稿

Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path

整流流泄漏之处:沿插值路径表征成员信号

Thomas Sesmat, Gabriel Meseguer-Brocal, Geoffroy Peeters

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 本文分析整流流(Rectified Flows)在插值路径上的训练数据成员信号,发现训练与测试数据的重建差异呈钟形曲线,并在高斯假设下推导出峰值位置,验证了该结构的普适性,并利用其进行成员推断攻击。

Comments ICML 2026 article, 9 main pages and 25 with annexes, 11 figures

详情
Journal ref
43rd International Conference on Machine Learning, Seoul, South Korea, 2026
AI中文摘要

理解生成模型从训练数据中保留了什么仍然具有挑战性,这对版权和隐私有影响。除了逐字复制外,模型可以编码训练数据中更微妙的痕迹,这些痕迹从未出现在输出中,但仍可利用。我们针对整流流(Rectified Flows)研究了这一机制,整流流越来越多地用于部署的生成系统。我们分析了定义整流流训练的插值路径 $X_\lambda = (1-\lambda)X_0 + \lambda X_1$。我们展示了训练数据和测试数据的重建之间存在一个差距,该差距在 $\lambda$ 上呈钟形曲线,并在训练过程中累积,而验证指标保持稳定。该信号有一个最大值,我们在高斯假设下推导出其位置的闭式解。我们在音频和图像上验证了这些预测,并表明钟形结构是普遍的,而峰值预测在我们的假设满足时成立。作为概念验证,我们利用这种特定的 $\lambda$ 解析结构进行成员推断攻击,区分训练集的成员和非成员。

英文摘要

Understanding what generative models retain from training data remains challenging, with implications for copyright and privacy. Beyond verbatim reproduction, models can encode subtler traces of their training data that never surface in their outputs yet remain exploitable. We study this regime for Rectified Flows, which are increasingly used in deployed generative systems. We analyse the interpolation path $X_λ= (1-λ)X_0 + λX_1$ that defines the Rectified Flow training. We show that a gap exists between the reconstruction of train and test data that follows a bell-shaped curve over $λ$, wich accumulates during training, while the validation metrics remain stable. The signal has a maximum whose location we derive in closed form under Gaussian assumptions. We validate these predictions on both audio and images and show that the bell-shaped structure is universal, while the peak prediction holds when our assumptions are satisfied. As a proof of concept, we exploit this specific $λ$-resolved structure to perform a Membership Inference Attack, distinguishing members of the training set from non-members.

2603.08683 2026-06-08 cs.SD cs.AI cs.LG eess.AS 版本更新

Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio

全保真音频无损压缩的语言建模基准测试

Phillip Long, Zachary Novack, Chris Donahue

发表机构 * University of California, San Diego, Computer Science and Engineering Department(加州大学圣地亚哥分校计算机科学与工程系) Carnegie Mellon University, School of Computer Science(卡内基梅隆大学计算机科学学院)

AI总结 提出字节级分词方案Trilobyte,将词汇量从指数级降至常数级,首次实现24位音频的LM无损压缩,并在8位和16位下超越FLAC。

Comments Accepted at Interspeech 2026, 7 pages, 5 figures

详情
AI中文摘要

在原始波形上训练的自回归“语言”模型(LM)可被重新用于无损音频压缩,但先前的工作仅限于8位音频,尚不清楚此类方法是否适用于实际场景(16/24位)以及能否与现有编解码器竞争。我们对基于LM的压缩在全保真音频上进行了基准测试,涵盖不同领域(音乐、语音、生物声学)、采样率(16kHz-48kHz)和位深度(8、16、24位)。标准的样本级分词在更高位深度下因词汇量过大(16位为65K;24位为16.7M)而变得不可行。我们提出了Trilobyte,一种用于全分辨率音频的字节级分词方案,将词汇量从$O(2^{b})$改进为$O(1)$,并首次实现了可行的24位基于LM的无损压缩。虽然LM在8位和16位下持续优于FLAC并达到最先进的压缩效果,但我们观察到,随着位深度超过8位,压缩增益变得更为有限。

英文摘要

Autoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchmark LM-based compression on full-fidelity audio across diverse domains (music, speech, bioacoustics), sampling rates (16kHz-48kHz), and bit depths (8, 16, 24-bit). Standard sample-level tokenization becomes intractable at higher bit depths due to vocabulary size (65K for 16-bit; 16.7M for 24-bit). We propose Trilobyte, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from $O(2^{b})$ to $O(1)$ and enabling the first tractable 24-bit LM-based lossless compression. While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit, we observe that compression gains become more modest as bit depth increases beyond 8-bit.

2603.26394 2026-06-08 cs.SD 版本更新

CA-TCN: A Causal-Anticausal Temporal Convolutional Network for Direct Auditory Attention Decoding

CA-TCN: 一种用于直接听觉注意解码的因果-反因果时序卷积网络

Iñigo García-Ugarte, Rubén Eguinoa, Ricardo San Martín, Daniel Paternain, Carmen Vidaurre

发表机构 * Department of Science, Universidad Pública de Navarra (UPNA)(纳瓦拉公共大学科学系) BCBL, Basque Center on Cognition Brain and Language(巴斯克认知脑语言中心) Department of Statistics, Computer Sciences and Mathematics, Universidad Pública de Navarra (UPNA)(纳瓦拉公共大学统计、计算机科学与数学系) Ikerbasque, Basque Foundation for Science(巴斯克科学基金会)

AI总结 提出CA-TCN,一种因果-反因果时序卷积网络,直接对注意说话者进行分类,通过分别采用因果和反因果卷积对齐听觉刺激与神经响应,在多个数据集上比AADNet提升0.5%-3.2%的解码准确率。

Comments 10+2(refs) pages, 5 figures, 4 Tables, IEEE transactions preprint

详情
AI中文摘要

在复杂听觉环境中引导听觉注意的一种有前景的方法依赖于听觉注意解码(AAD),其旨在从神经记录中识别多说话者场景中注意的语音流。基于夹带的AAD方法通常假设可以访问干净的语音源和脑电图(EEG)信号,以利用神经响应与注意刺激之间的低频相关性。在本研究中,我们提出了CA-TCN,一种因果-反因果时序卷积网络,直接对注意说话者进行分类。所提出的架构整合了卷积神经网络在序列处理任务中的若干最佳实践。重要的是,它通过分别采用具有不同感受野、在相反时间方向上操作的因果和反因果卷积,显式地对齐听觉刺激和神经响应。通过与三个基线AAD模型比较获得的实验结果表明,CA-TCN在数据集和决策窗口上一致地提高了解码精度,与次优模型AADNet相比,在主题无关模型中增益范围为0.5%至3.2%,在主题特定模型中增益范围为0.8%至2.9%。此外,在比较最小期望切换持续时间分布时,这些改进在六个评估设置中的四个中具有统计显著性。除了准确性,该模型在不同条件下表现出空间鲁棒性,因为EEG空间滤波器在数据集上表现出稳定的模式。总体而言,这项工作引入了一个准确且统一的AAD模型,其性能优于现有方法,同时考虑了在线处理场景的实际优势。这些发现有助于推进AAD的发展及其在现实世界系统中的适用性。

英文摘要

A promising approach for steering auditory attention in complex listening environments relies on Auditory Attention Decoding (AAD), which aim to identify the attended speech stream in a multiple speaker scenario from neural recordings. Entrainment-based AAD approaches, typically assume access to clean speech sources and electroencephalography (EEG) signals to exploit low-frequency correlations between the neural response and the attended stimulus. In this study, we propose CA-TCN, a Causal-Anticausal Temporal Convolutional Network that directly classifies the attended speaker. The proposed architecture integrates several best practices from convolutional neural networks in sequence processing tasks. Importantly, it explicitly aligns auditory stimuli and neural responses by employing separate causal and anticausal convolutions respectively, with distinct receptive fields operating in opposite temporal directions. Experimental results, obtained through comparisons with three baseline AAD models, demonstrated that CA-TCN consistently improved decoding accuracy across datasets and decision windows, with gains ranging from 0.5% to 3.2% for subject-independent models and from 0.8% to 2.9% for subject-specific models compared with the next best-performing model, AADNet. Moreover, these improvements were statistically significant in four of the six evaluated settings when comparing Minimum Expected Switch Duration distributions. Beyond accuracy, the model demonstrated spatial robustness across different conditions, as the EEG spatial filters exhibited stable patterns across datasets. Overall, this work introduces an accurate and unified AAD model that outperforms existing methods while considering practical benefits for online processing scenarios. These findings contribute to advancing the state of AAD and its applicability in real-world systems.

2606.01802 2026-06-08 cs.SD cs.AI 版本更新

MOSS-Audio Technical Report

MOSS-Audio 技术报告

Chen Yang, Chufan Yu, Hanfu Chen, Jie Zhu, Jingqi Chen, Ke Chen, Wenxuan Wang, Yang Wang, Yaozhou Jiang, Yi Jiang, Zhengyuan Lin, Ziqi Chen, Zhaoye Fei, Chenghao Liu, Donghua Yu, Jun Zhan, Kang Yu, Kexin Huang, Liwei Fan, Mingshu Chen, Qinyuan Cheng, Ruixiao Li, Shimin Li, Songlin Wang, Xingjian Zhao, Yang Gao, Yitian Gong, Yiyang Zhang, Zhe Xu, Xipeng Qiu

发表机构 * OpenMOSS Team(开放MOSS团队)

AI总结 提出统一音频-语言模型 MOSS-Audio,通过 DeepStack 跨层特征注入和时间标记实现语音、环境声和音乐的理解,在音频字幕、时间感知问答、时间戳转录和音频推理任务上取得强性能。

详情
AI中文摘要

MOSS-Audio 是一个统一的音频-语言模型,用于语音、环境声和音乐的理解,支持音频字幕、时间感知问答、时间戳转录和基于音频的推理。MOSS-Audio 将专用音频编码器与模态适配器和大语言模型耦合:编码器产生 12.5 Hz 的时间表示,适配器将其投影到解码器空间,解码器生成自回归文本输出。两个设计选择是系统的核心: extbf{DeepStack 跨层特征注入},将来自多个编码器深度的声学信息暴露给解码器;以及 extbf{时间标记},通过在音频标记流中插入时间戳标记来提供显式的时间线索。在数据层面,我们设计了一个事件保留的音频标注流程,将原始音频在连贯的事件边界处分割,对语音、音乐和通用音频应用分支特定标注,并将结果合并为统一的字幕用于预训练。中间的分支特定字幕进一步保留以支持面向任务的 SFT 数据的构建。该模型在大规模音频-语言数据上进行预训练,结合时间感知目标以支持时间定位,然后进行多阶段后训练以增强指令遵循和基于音频的推理。我们发布了 4B 和 8B 两种变体,包括 Instruct 和 Thinking 配置。MOSS-Audio 在通用音频理解、语音字幕、ASR 和时间戳 ASR 上取得了强性能,使其成为未来语音代理的有前途的理解基础。

英文摘要

MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates autoregressive text outputs. Two design choices are central to the system: DeepStack cross-layer feature injection, which exposes the decoder to acoustic information from multiple encoder depths, and time markers, which provide explicit temporal cues by inserting timestamp markers into the audio-token stream. At the data level, we design an event-preserving audio annotation pipeline that segments raw audio at coherent event boundaries, applies branch-specific annotation to speech, music, and general audio, and merges the results into unified captions for pretraining. The intermediate branch-specific captions are further retained to support the construction of task-oriented SFT data. The model is pretrained on large-scale audio-language data, with time-aware objectives incorporated to support temporal grounding, and then undergoes multi-stage post-training to enhance instruction following and audio-grounded reasoning. We release 4B and 8B variants in both Instruct and Thinking configurations. MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a promising understanding foundation for future voice agents.

2512.00883 2026-06-08 cs.MM cs.CV cs.SD 版本更新

Audio-Visual World Models: Grounding Multisensory Imagination for Embodied Agents

视听世界模型:为具身智能体奠定多感官想象的基础

Jiahua Wang, Leqi Zheng, Jialong Wu, Yaoxin Mao, Shijie Cheng

发表机构 * Tsinghua University(清华大学) Beijing Institute of Technology(北京理工大学)

AI总结 提出视听世界模型(AVWM)统一框架,通过条件扩散Transformer(AV-CDiT)联合预测双耳音频与视觉动态,在30小时基准AVW-4k上实现高保真多模态预测,并验证其在具身导航中的有效性。

详情
AI中文摘要

世界模型通过模拟环境动态使智能体能够规划和推理未来状态。虽然现有方法主要关注视觉观察,但现实世界的感知本质上涉及多种感觉模态。音频提供了关键的空间和时间线索,如声源定位和声学场景属性,但其整合到世界模型中仍相对未被充分探索。先前的工作尚未建立低层动作控制下视听世界建模的通用公式,也未阐明如何联合捕捉物理上合理的双耳音频和视觉动态。本文提出了视听世界模型(AVWM)的统一公式,将多模态环境模拟建模为具有同步视听观测的部分可观测马尔可夫决策过程。作为解决该问题的基础步骤,我们构建了AVW-4k,一个受控基准数据集,包含30小时的双耳视听轨迹,覆盖76个室内环境并带有动作标注。我们提出了AV-CDiT,一种视听条件扩散Transformer,采用新颖的模态专家架构平衡视觉和听觉学习,通过三阶段训练策略优化以实现有效的多模态整合。在该基准上的大量实验表明,AV-CDiT在视觉和听觉模态上实现了高保真多模态预测。此外,我们验证了其在具身导航中的实际效用,证明AVWM改进了视觉-语言模型引导的智能体在连续视听导航中的表现。

英文摘要

World models simulate environmental dynamics to enable agents to plan and reason about future states. While existing approaches have primarily focused on visual observations, real-world perception inherently involves multiple sensory modalities. Audio provides crucial spatial and temporal cues such as sound source localization and acoustic scene properties, yet its integration into world models remains relatively underexplored. Prior work has not established a commonly adopted formulation for audio-visual world modeling under low-level action control or clarified how to jointly capture physically grounded binaural audio and visual dynamics. This work presents a unified formulation of Audio-Visual World Models (AVWM), casting multimodal environment simulation as a partially observable Markov decision process with synchronized audio-visual observations. As a foundational step toward this problem, we construct AVW-4k, a controlled benchmark comprising 30 hours of binaural audio-visual trajectories with action annotations across 76 indoor environments. We propose AV-CDiT, an Audio-Visual Conditional Diffusion Transformer with a novel modality expert architecture that balances visual and auditory learning, optimized through a three-stage training strategy for effective multimodal integration. Extensive experiments on this benchmark demonstrate that AV-CDiT achieves high-fidelity multimodal prediction across visual and auditory modalities. Furthermore, we validate its practical utility in embodied navigation, demonstrating that AVWM improves a vision-language-model-guided agent in continuous audio-visual navigation.

2606.05763 2026-06-08 eess.AS cs.SD 版本更新

M2S-AVSR: Modality-aware Multi-view Self-supervised Representation for Robust Audio-Visual Speech Recognition

M2S-AVSR:面向鲁棒视听语音识别的模态感知多视角自监督表示

Fei Su, Cancan Li, Ming Li, Juan Liu

发表机构 * School of Artificial Intelligence and the School of Computer Science, Wuhan University, China(人工智能学院和计算机科学学院,武汉大学,中国) School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen, China(人工智能学院,香港中文大学(深圳),中国) School of Artificial Intelligence, Wuhan University, China(人工智能学院,武汉大学,中国)

AI总结 提出一种模态感知多视角自监督表示框架(M2S-AVSR),通过多视角编码学习视角不变视觉语音表示,并利用模态感知模块进行细粒度融合,以应对视角变化、音频失真和视觉遮挡等挑战,在多个基准上取得最优性能。

Comments submitted to IEEE Transactions on Audio, Speech, and Language Processing

详情
AI中文摘要

视听语音识别(AVSR)通过利用视觉线索增强语音识别的鲁棒性,而现实场景由于视角变化、音频失真和视觉遮挡而仍然具有挑战性,这些因素会降低模态质量并增加视听异步性。在本文中,我们提出了一种新颖的模态感知多视角自监督表示框架,用于鲁棒的视听语音识别(M2S-AVSR)。首先,我们引入了一个多视角表示学习编码器,以学习视角不变的视觉语音表示。其次,我们采用了一个模态感知模块,该模块显式地对模态质量和跨模态同步性进行建模,以执行细粒度的模态感知融合,从而在解码过程中实现细粒度的视觉信息注入。此外,我们提出了AISHELL8-RealScene,一个在真实环境中录制的公开多场景、多视角对话视听数据集,并在此基础上建立了语音识别基准。在英语和普通话基准上的实验证明了所提出方法在挑战性条件下的有效性。在LRS3上,M2S-AVSR在视角扰动和视觉退化设置下实现了高达29.4%的相对改进。我们的方法还在MISP2021-AVSR测试集上取得了新的最先进性能。在AISHELL8-RealScene上,它在户外场景中取得了最佳结果。所提出的方法和数据集为未来在现实条件下进行鲁棒语音和多模态任务的研究提供了有用的支持。

英文摘要

Audio-Visual Speech Recognition (AVSR) enhances speech recognition robustness by leveraging visual cues, while real-world scenarios remain challenging due to viewpoint variation, audio distortion, and visual occlusion, which degrade modality quality and increase audio-visual asynchrony. In this paper, we propose a novel Modality-aware Multi-view Self-supervised representation framework for robust Audio-Visual Speech Recognition (M2S-AVSR). First, we introduce a multi-view representation learning encoder to learn view-invariant visual speech representations. Next, we employ a modality-aware module that explicitly models modality quality and cross-modal synchrony to perform fine-grained modality-aware fusion, enabling fine-grained visual information injection during decoding. In addition, we release AISHELL8-RealScene, a public multi-scenario, multi-view conversational audio-visual dataset recorded in real-world environments, and establish a speech recognition benchmark on it. Experiments on English and Mandarin benchmarks demonstrate the effectiveness of the proposed method under challenging conditions. On LRS3, M2S-AVSR achieves up to 29.4% relative improvement under viewpoint perturbation and visual degradation settings. Our method also achieves new state-of-the-art performance on the MISP2021-AVSR test set. On AISHELL8-RealScene, it achieves the best result in outdoor scenes. The proposed method and dataset provide useful support for future research on robust speech and multimodal tasks under realistic conditions.

2507.17799 2026-06-08 eess.AS cs.LG cs.SD 版本更新

A Concept-based approach to Voice Disorder Detection

基于概念的方法用于声带疾病检测

Davide Ghia, Gabriele Ciravegna, Alkis Koudounas, Marco Fantini, Erika Crosetti, Giovanni Succo, Tania Cerquitelli

发表机构 * Politecnico di Torino CENTAI Institute(CENTAI研究院) San Feliciano Hospital(San Feliciano医院) SCDU Otorinolaringoiatria, Head Neck Cancer Unit, Ospedale San Giovanni Bosco(SCDU耳鼻喉科,头颈癌症单元,San Giovanni Bosco医院) Dipartimento di Oncologia, Università degli Studi di Torino(肿瘤学系,托里尼大学)

AI总结 本文提出基于概念的声带疾病检测方法,利用可解释AI提升模型透明度,与传统深度学习方法相比,实现更清晰的决策框架。

详情
AI中文摘要

声带疾病影响了大量人口,使用自动化非侵入性技术进行诊断将显著推动医疗进步,提高患者生活质量。近期研究表明,人工智能模型,特别是深度神经网络(DNNs),能有效解决此任务。然而,由于其复杂性,此类模型的决策过程常不透明,限制了其在临床中的可信度。本文探讨了基于可解释AI(XAI)的替代方法,旨在通过提供不同形式的解释来提高DNNs的可解释性。具体而言,本文聚焦于概念模型,如概念瓶颈模型(CBM)和概念嵌入模型(CEM),探讨它们如何在性能上与传统深度学习方法相媲美,同时提供更透明和可解释的决策框架。

英文摘要

Voice disorders affect a significant portion of the population, and the ability to diagnose them using automated, non-invasive techniques would represent a substantial advancement in healthcare, improving the quality of life of patients. Recent studies have demonstrated that artificial intelligence models, particularly Deep Neural Networks (DNNs), can effectively address this task. However, due to their complexity, the decision-making process of such models often remain opaque, limiting their trustworthiness in clinical contexts. This paper investigates an alternative approach based on Explainable AI (XAI), a field that aims to improve the interpretability of DNNs by providing different forms of explanations. Specifically, this works focuses on concept-based models such as Concept Bottleneck Model (CBM) and Concept Embedding Model (CEM) and how they can achieve performance comparable to traditional deep learning methods, while offering a more transparent and interpretable decision framework.