arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.31521 2026-06-01 cs.CL cs.SD 版本更新

UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception

UniAudio-Token: 赋予语义语音分词器通用音频感知能力

Yuhan Song, Linhao Zhang, Aiwei Liu, Chuhan Wu, Sijun Zhang, Wei Jia, Yuan Liu, Houfeng Wang, Xiao Zhou

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(信息处理国家重点实验室,计算机学院,北京大学) Basic Model Technology Center, WeChat AI, Tencent Inc.(基础模型技术中心,微信AI,腾讯公司)

AI总结 提出UniAudio-Token框架,通过语义-声学基元(SAP)和语义-声学均衡(SAE)机制,在不牺牲语音能力的前提下为语义分词器注入通用音频感知,实现统一音频接口。

Comments 19 pages, 10 figures

详情
AI中文摘要

语义语音分词器因其紧凑的单码本设计和强语言对齐能力,已成为音频-大语言模型广泛使用的接口。然而,它们对语言抽象的关注导致了声学盲点,限制了其在以语音为中心的任务之外的适用性。我们提出UniAudio-Token,一个在不损害语音能力的前提下赋予语义分词器通用音频感知能力的框架。UniAudio-Token并非改变语义范式,而是通过两个关键创新来减轻其信息损失:(1) 语义-声学基元(SAP)通过将音频分解为语言内容、声音属性和听觉场景基元来提供结构化监督;(2) 语义-声学均衡(SAE)引入了一种内容感知门控机制,自适应地从浅层恢复细粒度声学细节。广泛评估表明,UniAudio-Token在学习全面的通用表示的同时,保持了高保真语音生成。当与下游大语言模型集成时,它在理解和生成任务上均优于所有单码本基线分词器,有效地作为统一音频接口。我们在https://github.com/Tencent/Universal_Audio_Tokenizer上公开发布了所有代码,包括训练和推理脚本以及模型检查点。

英文摘要

Semantic speech tokenizers have become a widely used interface for Audio-LLMs, owing to their compact single-codebook design and strong linguistic alignment. However, their focus on linguistic abstraction induces acoustic blindness, limiting their applicability beyond speech-centric tasks. We propose UniAudio-Token, a framework that empowers semantic tokenizers with general audio perception without compromising speech ability. Instead of altering the semantic paradigm, UniAudio-Token mitigates its information loss through two key innovations: (1) Semantic-Acoustic Primitives (SAP) provide structured supervision by decomposing audio into linguistic content, vocal attributes, and auditory-scene primitives; and (2) Semantic-Acoustic Equilibrium (SAE) introduces a content-aware gating mechanism that adaptively restores fine-grained acoustic details from shallow layers. Extensive evaluations show that UniAudio-Token learns comprehensive universal representations while preserving high-fidelity speech generation. When integrated with downstream LLMs, it outperforms all single-codebook baseline tokenizers on both understanding and generation tasks, effectively serving as a unified audio interface. We publicly release all our code, including training and inference scripts, together with the model checkpoints at https://github.com/Tencent/Universal_Audio_Tokenizer.

2605.31469 2026-06-01 cs.CL cs.AI cs.SD eess.AS 版本更新

Scaling Conversational Hungarian ASR: The BEA-Dialogue+ Corpus

扩展匈牙利语对话ASR:BEA-Dialogue+语料库

Máté Gedeon, Piroska Zsófia Barta, Péter Mihajlik, Katalin Mády

发表机构 * Department of Telecommunications and Artificial Intelligence(电信与人工智能系) Budapest University of Technology and Economics(布达佩斯技术与经济大学) Speechtex Ltd.(Speechtex公司) ELTE Research Centre for Linguistics(ELTE语言学研究中心)

AI总结 针对匈牙利语对话语音识别训练数据不足的问题,本文通过放宽分割标准扩展BEA-Dialogue语料库至200小时,并评估基于Whisper和FastConformer的模型,证明基于序列化输出训练的微调能持续改善识别性能。

详情
AI中文摘要

匈牙利语对话自动语音识别受到公开对话式训练数据有限的制约。BEA-Dialogue语料库解决了这一需求,但其严格的说话人分离的训练/开发/测试分割将可用材料减少到仅85小时。在本文中,我们介绍了BEA-Dialogue+,这是该语料库的扩展版本,它放宽了实验者和对话伙伴的分割标准,同时保持主要说话人的完全分离。这产生了200小时转录的自然对话,并允许对额外训练数据与分割间说话人重叠之间的权衡进行受控研究。我们在两个语料库版本上评估了多个基于Whisper和FastConformer的模型,包括基于序列化输出训练(SOT)的对话转录微调。我们的结果表明,对于未经微调的模型,较大的语料库更具挑战性,而基于SOT的适应在WER、CER、cpWER和cpCER上产生了一致的改进。总体而言,BEA-Dialogue+为匈牙利语对话ASR提供了一个更大但仍具挑战性的基准,以及用于训练和评估对话转录系统的实用资源。

英文摘要

Conversational automatic speech recognition in Hungarian is constrained by the limited amount of publicly available dialogue-style training data. The BEA-Dialogue corpus addresses this need, but its strictly speaker-disjoint train/dev/eval split reduces the usable material to only 85 hours. In this paper, we introduce BEA-Dialogue+, an expanded version of the corpus that relaxes the split criterion for experimenters and dialogue partners while preserving complete separation of the primary speakers. This results in 200 hours of transcribed natural conversations and enables a controlled study of the trade-off between additional training data and speaker overlap across the splits. We evaluate several Whisper- and FastConformer-based models on both corpus versions, including Serialized Output Training (SOT)-based fine-tuning for dialogue transcription. Our results show that the larger corpus is more challenging for models without fine-tuning, whereas SOT-based adaptation yields consistent improvements in WER, CER, cpWER, and cpCER. Overall, BEA-Dialogue+ provides a substantially larger yet still demanding benchmark for Hungarian dialogue ASR, and a practical resource for training and evaluating dialogue transcription systems.

2605.31432 2026-06-01 cs.CL cs.AI cs.SD 版本更新

DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs

DOA:面向语音大语言模型的长形式同声传译的无训练解码器仅注意力策略

Sara Papi, Luisa Bentivogli

发表机构 * Fondazione Bruno Kessler(布鲁诺·克塞塞基金会)

AI总结 提出DOA策略,利用解码器自注意力导出代理对齐,无需训练即可实现语音大语言模型在长形式同声传译中的流式决策。

详情
AI中文摘要

同声语音到文本翻译(SimulST)在语音尚未完成时生成翻译,需要流式策略来决定何时读取和何时写入。最先进的方法依赖于基于注意力的编码器-解码器模型,其中交叉注意力提供显式的对齐信号。相比之下,语音大语言模型(SpeechLLMs)是仅解码器架构,仅依赖自注意力。这引发了一个核心问题:解码器自注意力是否包含足够稳定的对齐信号来指导流式策略。此外,现有方法通常依赖于基于训练的适应或启发式等待-$k$策略,并且尚未在长形式场景中得到验证。为了填补这些空白,我们提出了仅解码器注意力(DOA),这是一种无训练策略,通过从自注意力中导出代理对齐,使现成的SpeechLLMs能够进行长形式同声传译。在Phi4-Multimodal和Qwen3-Omni上的实验表明,DOA提供了有效的对齐信号来支持流式决策,实现了低延迟的长形式SimulST,其质量接近无需重新训练的离线解码。

英文摘要

Simultaneous speech-to-text translation (SimulST) generates translations while speech is still unfolding, requiring a streaming policy that decides when to read and when to write. State-of-the-art approaches rely on attention-based encoder-decoder models where cross-attention provides explicit alignment signals. In contrast, Speech Large Language Models (SpeechLLMs) are decoder-only architectures relying solely on self-attention. This raises a central question: whether decoder self-attention contains sufficiently stable alignment signals to guide the streaming policy. Moreover, existing approaches typically rely on training-based adaptations or heuristic wait-$k$ policies and have not been validated in long-form settings. To fill these gaps, we propose Decoder-Only Attention (DOA), a training-free policy that enables long-form simultaneous translation with off-the-shelf SpeechLLMs by deriving a proxy alignment from self-attention. Experiments on Phi4-Multimodal and Qwen3-Omni show that DOA provides an effective alignment signal for supporting streaming decisions, enabling low-latency long-form SimulST with quality close to offline decoding without retraining.

2605.31295 2026-06-01 cs.SD cs.AI cs.IR cs.LG 版本更新

Latent Space Disentanglement via Activation Steering for Interpretable Attribute Control in Symbolic Music Generation

通过激活引导实现潜在空间解缠:符号音乐生成中可解释的属性控制

Ioannis Prokopiou, Pantelis Vikatos, Maximos Kaliakatsos-Papakostas, Theodoros Giannakopoulos, Themos Stafylakis

发表机构 * Athens University of Economics Innovation Lab Orfium Athens, Greece Department of Music Technology Acoustics Hellenic Mediterranean University Rethymno, Greece Institute of Informatics \& Telecommunications National Center for Scientific Research “Demokritos” Athens, Greece Department of Informatics Athens University of Economics

AI总结 本文利用差分均值方法从多轨音乐Transformer的残差流中分离音高和时长的潜在方向,并通过Gram-Schmidt正交化实现双属性引导,从而在推理时实现可解释的确定性属性调制。

Comments Accepted at EUSIPCO 2026 (34th European Signal Processing Conference), 5 pages, 2 figures

详情
AI中文摘要

基于Transformer的架构在生成复杂符号序列方面取得了显著进展,但在实现对离散信号属性的细粒度、可解释控制方面仍存在显著差距。本文研究了多轨音乐Transformer(MMT)的机制可解释性,并提出了一种无需重新训练的确定性属性调制框架,通过推理时的激活引导来弥合这一差距。利用差分均值(DiffMean)方法,我们在残差流中分离了信号属性(特别是音高和时长)的潜在方向。我们验证了该领域的线性表示假设,实现了引导幅度与属性偏移之间的高相关性。为了解决多属性引导中固有的特征纠缠问题,我们引入了一种利用Gram-Schmidt正交化的双引导框架。实验结果表明,与简单的向量加法相比,这种几何解耦减少了概念干扰和信号退化,即使在强自回归条件下也能实现独立的确定性控制。

英文摘要

Transformer-based architectures have significantly advanced the generation of complex symbolic sequences, yet a significant gap remains in achieving fine-grained, interpretable control over discrete signal attributes. This paper investigates the mechanistic interpretability of the Multitrack Music Transformer (MMT) and proposes a framework for deterministic attribute modulation without retraining to bridge this gap via inference-time activation steering. Utilizing the Difference-in-Means (DiffMean) methodology, we isolate latent directions for signal attributes, specifically Pitch and Duration, within the residual stream. We validate the Linear Representation Hypothesis in this domain, achieving high correlation between steering magnitude and attribute shift. To address the inherent feature entanglement in multi-attribute steering, we introduce a Dual Steering framework utilizing Gram-Schmidt Orthogonalization. Experimental results demonstrate that this geometric decoupling reduces conceptual interference and signal degradation compared to naive vector addition, enabling independent deterministic control even against strong autoregressive conditioning.

2605.31173 2026-06-01 cs.SD cs.AI 版本更新

MindVoice: Reconstructing Intelligible Speech from Non-invasive Neural Signals with Pretrained Priors

MindVoice: 利用预训练先验从非侵入性神经信号重建可理解语音

Guangyin Bao, Taiping Zeng, Jianfeng Feng, Xiangyang Xue

发表机构 * Fudan University(复旦大学)

AI总结 提出MindVoice框架,通过解耦语义和声学路径并融合预训练生成模型与语音克隆,从EEG/MEG信号中重建出可理解语音,显著优于现有方法。

详情
AI中文摘要

从非侵入性神经记录中重建连续语音是探究人类听觉感知和构建安全、可扩展的语音脑机接口的基本问题。尽管近期取得进展,但由于非侵入性记录本身存在噪声、空间模糊且仅部分保留感知语音信息,可理解的重建仍然难以实现。现有方法直接将神经活动映射到纠缠的语音表征,然后使用神经声码器合成波形,导致结果频谱相似但不可理解。为克服这些限制,我们引入MindVoice,一种神经到语音的重建框架,利用预训练模型补偿神经记录中不完整的语义和声学信息。MindVoice将重建解耦为两条互补路径:一条恢复高层语义内容,另一条估计细粒度声学属性。这些推断的表征随后与强大的语音生成模型和上下文语音克隆融合,以合成自然且可理解的语句。在EEG和MEG上的大量实验表明,MindVoice在各种指标上显著优于现有方法。这些结果表明,预训练先验为弥合噪声神经记录与自然语音之间的差距提供了一种原则性方法,凸显了听觉神经科学研究和非侵入性语音脑机接口的一个有前景的尝试。

英文摘要

Reconstructing continuous speech from non-invasive neural recordings is a fundamental problem for probing human auditory perception and building safe, scalable speech brain-computer interfaces. Despite recent progress, intelligible reconstruction remains elusive, as non-invasive recordings are inherently noisy, spatially blurred, and only partially preserve information about perceived speech. Existing methods directly map neural activity to entangled speech representations before synthesizing waveforms with neural vocoders, resulting in spectral-similar but unintelligible results. To overcome these limitations, we introduce MindVoice, a neuro-to-speech reconstruction framework that uses pretrained models to compensate for the incomplete semantic and acoustic information in neural recordings. MindVoice disentangles reconstruction into two complementary pathways: one recovers high-level semantic content, while the other estimates fine-grained acoustic attributes. These inferred representations are then fused with powerful speech generation models and in-context voice cloning to synthesize natural and intelligible utterances. Extensive experiments on EEG and MEG demonstrate that MindVoice substantially outperforms existing methods on various metrics. These results show that pretrained priors provide a principled way to bridge the gap between noisy neural recordings and natural speech, highlighting a promising attempt for auditory neuroscience research and non-invasive speech brain-computer interfaces.

2605.31082 2026-06-01 cs.SD cs.MM 版本更新

Sound effects in media:A comparative analysis of recorded and synthetic samples in live-action and animation

媒体中的音效:实拍与动画中录制样本与合成样本的比较分析

Nelly Garcia, Joshua Reiss

发表机构 * Centre for Digital Music (C4DM),Queen Mary University of London(数字音乐中心(C4DM)、伦敦女王玛丽大学)

AI总结 通过比较程序化生成的合成音效与真实录制音效在实拍和动画场景中的可信度,发现合成音效在戏剧和科幻场景中表现良好,但在卡通日常动作中可信度较低。

Comments ArtsIT, Interactivity and Game Creation 2024

详情
AI中文摘要

为故事创作声音对于电影、电视剧和视频游戏等作品中环境的建立至关重要。这一过程通常涉及重复、分层和录制真实物体或使用音效库,这可能耗时且重复。为了解决这些挑战,程序化音频(也称为数字拟音)提供了一种解决方案,允许声音设计师快速生成样本。尽管效率高,但合成样本与真实样本相比的可信度仍存在问题。在我们的研究中,我们比较了由在线程序化引擎生成的合成样本,并将其与动画和实拍画面集成。我们的结果表明,程序化音频在戏剧和科幻场景中非常有效且被认为可信,特别是对于激光、打击、空气和火箭等声音模型,而合成声音在表现日常动作的卡通制作中不太可信。最后,我们确定了需要优化的特定模型,并根据音频专业人士的反馈强调了需要改进的音频特征。

英文摘要

Creating sound for storytelling is crucial to establishing the environment in productions such as films, TV series and video games. This process often involves repeating, layering and recording real objects or using sound libraries, which can be time-consuming and repetitive. To address these challenges, procedural audio, also known as digital foley, offers a solution by allowing sound designers to quickly generate samples. Despite its efficiency, questions remain about the believability of synthetic samples compared to real ones. In our study, we compared synthetic samples generated by an online procedural engine and integrated them with both animated and live-action visuals. Our results indicate that procedural audio is highly effective and perceived as believable in drama and sci-fi scenes, particularly for sound models such as lasers, hits, air and rockets, whereas synthetic sounds weren't as believable in cartoon productions when representing everyday actions. Finally, we identified specific models that needed optimisation and highlighted audio features that needed improvement with feedback from audio professionals.

2605.31053 2026-06-01 cs.SD cs.AI 版本更新

AnchorSteer: Self-Discovered Concept Injection for Structure-Preserving Music Editing

AnchorSteer: 自发现概念注入用于结构保持的音乐编辑

Chih-Heng Chang, Keng-Seng Ho, Chih-Yu Tsai, Kuan-Lin Chen, Yi-Hsuan Yang, Jian-Jiun Ding

发表机构 * National Taiwan University(国立台湾大学)

AI总结 提出AnchorSteer框架,通过结构锚定与自发现语义注入解耦语义-结构纠缠,实现高保真结构保持下的显著语义变换。

Comments Accepted by the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情
AI中文摘要

可控音乐编辑旨在修改高级属性,同时严格保留节奏和旋律结构。然而,这一任务面临语义-结构纠缠的挑战:引导方法往往为了编辑性能而牺牲结构,而结构适配器则抑制语义响应。我们提出AnchorSteer,一个通过将结构锚定与自发现语义引导耦合来解耦这种张力的框架。该方法通过自监督重构目标探测内部表示,提取可解释、无标签的概念向量,无需精心策划的数据即可隔离属性。在编辑过程中,这些便携、即插即用的概念向量被注入扩散隐空间,同时结构适配器强制执行一致性。提供了无条件和条件注入的变体,以平衡鲁棒性和语义强度。在ZoME-Bench和主观测试上的实验表明,所提出的框架优于纯引导和纯锚定的基线,实现了高保真结构保持下的显著语义变换。

英文摘要

Controllable music editing is to modify high-level attributes while strictly preserving rhythmic and melodic structures. However, this task is challenged by a semantic-structural entanglement: steering methods often degrade structure to achieve editing performance, while structural adaptors suppress semantic responsiveness. We propose AnchorSteer, a framework that disentangles this tension by coupling structural anchoring with self-discovered semantic steering. The proposed approach probes internal representations to extract interpretable, label-free concept vectors via a self-supervised reconstruction objective, isolating attributes without curated data. During editing, these portable, plug-and-play concept vectors are injected into diffusion hidden manifolds while a structural adaptor enforces consistency. Variants for unconditioned and conditioned injections are provided to balance robustness and semantic strength. Experiments on ZoME-Bench and subjective tests show that the proposed framework outperforms both steering-only and anchoring-only baselines, enabling significant semantic transformations with high-fidelity structural preservation.

2605.30940 2026-06-01 eess.AS cs.MM cs.SD 版本更新

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

面向流式同步空间音频生成的自回归扩散Transformer

Ke Lei, Yu Zhang, Changhao Pan, Xueyi Pu, Wenxiang Guo, Ruiqi Li, Zhou Zhao

发表机构 * Zhejiang University, China(浙江大学)

AI总结 提出SwanSphere统一流式框架,通过因果自回归扩散Transformer、空间视频-音频对比学习及多目标在线直接偏好优化,实现从全景视频和文本提示生成高保真空间音频,并开发自动化标注管道缓解数据稀缺。

Comments Accepted by ICML 2026

详情
AI中文摘要

实时且准确的空间音频生成对于提供沉浸式体验至关重要。然而,现有的空间音频合成技术通常受限于生成质量与高推理延迟之间的权衡,以及难以从多模态输入中捕获精确的空间信息。为应对这些挑战,我们提出了SwanSphere,一个统一的流式框架,用于从全景视频和文本提示生成高保真空间音频。SwanSphere主要做出以下贡献:1)我们引入了一种因果自回归扩散Transformer架构,支持流式高质量空间音频生成。2)我们设计了一种空间视频-音频对比学习策略,以对齐视频编码器与声学领域,并进一步采用多目标在线直接偏好优化方案,从而实现强大的空间感知和鲁棒的多模态空间音频合成。3)为缓解当前空间音频数据集的稀缺性,我们还开发了一个自动化标注管道,用于生成详细的空间描述。实验结果表明,SwanSphere在视频到空间和文本到空间音频生成任务中均取得了优越性能。演示可在 https://swanaigc.github.io 找到。

英文摘要

Real-time and accurate spatial audio generation is pivotal for delivering an immersive experience. However, existing spatial audio synthesis technologies are often encumbered by a tradeoff between generation quality and high inference latency, as well as difficulty in capturing precise spatial information from multimodal inputs. To address these challenges, we propose SwanSphere, a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts. SwanSphere mainly makes the following contributions: 1) We introduce a causal autoregressive diffusion transformer architecture that enables streaming high-quality spatial audio generation. 2) We design a Spatial Video-Audio Contrastive (SVAC) learning strategy to align the video encoder with the acoustic domain, and further employ a multi-objective online direct preference optimization (ODPO) scheme, resulting in strong spatial perception and robust multimodal spatial audio synthesis. 3) To alleviate the current scarcity of spatial audio datasets, we also develop an automated annotation pipeline for generating detailed spatial captions. Experimental results demonstrate that SwanSphere achieves superior performance in both video-to-spatial and text-to-spatial audio generation tasks. Demos can be found at: https://swanaigc.github.io.

2605.30899 2026-06-01 eess.AS cs.AI cs.SD 版本更新

A Unified and Reproducible Experimentation Framework for Speech Understanding

语音理解的统一可复现实验框架

Jing Peng, Junhao Du, Chenghao Wang, Hanqi Li, Yi Yang, Yixuan Wang, Xiaoyu Gu, Guanyu Chen, Yucheng Wang, Jiang Li, Zhangjie Zhao, Haoran Wang, Wenming Tu, Haoyu Li, Duo Ma, Lirong Qian, Yu Xi, Wen Wen, Jiaqi Guo, Hui Zhang, Shuai Fan, Wenbin Jiang, Shuai Wang, Kai Yu

发表机构 * X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University(上海交通大学计算机科学与工程系X-LANCE实验室) MoE Key Lab of Artificial Intelligence(人工智能MOE重点实验室) Jiangsu Key Lab of Language Computing(江苏省语言计算重点实验室) AISpeech Ltd(AISpeech有限公司) ETH Zürich(苏黎世联邦理工学院) Nanjing University(南京大学) Hangzhou Dianzi University(杭州电子科技大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出SURE框架,通过标准化预测格式、归一化和评分,以及代理辅助的训练转换流程,提高语音理解模型在部署场景下的可比性和可复现性。

Comments This paper is submitted to INTERSPEECH 2026

详情
AI中文摘要

语音基础模型和语音大语言模型推动了语音理解的发展,但面向部署的模型选择受到非可比评估的阻碍,这些评估由不匹配的后处理以及跨数据规模和流水线难以复现的训练结果导致。我们提出了SURE,一个统一的实验框架,标准化了预测格式、归一化和评分。SURE评估了从传统流水线到语音大语言模型的各种范式下的强系统,在代表性任务上施加了现实声学和语言压力。除了评估,SURE还引入了一种代理辅助的训练转换流程,该流程将论文和代码映射到统一协议下、基于匹配开放数据子集的版本化、可运行训练流水线。总体而言,SURE提高了面向部署评估的可比性和可复现性。

英文摘要

Speech foundation models and Speech LLMs have advanced speech understanding, yet deployment-oriented model selection is hindered by non-comparable evaluations caused by mismatched post-processing, and by training results that are hard to reproduce across data scales and pipelines. We present SURE, a unified experimentation framework that standardizes prediction formats, normalization, and scoring. SURE evaluates strong systems across paradigms, from conventional pipelines to Speech LLMs, on representative tasks under realistic acoustic and linguistic stressors. Beyond evaluation, SURE introduces an agent-assisted training conversion flow that maps paper and code into versioned, runnable training pipelines under a unified protocol on matched open-data subsets. Overall, SURE improves comparability and reproducibility for deployment-oriented evaluation.

2605.30818 2026-06-01 cs.ET cs.AI cs.SD 版本更新

GaMi: Geometry-Agnostic Material Identification via Cross-Modal Subtractive Disentanglement

GaMi: 通过跨模态减法解缠实现几何无关的材料识别

Zhiwei Chen, Yijie Li, Yimo Zhang, Shiyun Shao, Yichao Chen, Dian Ding, Liang Wang, Haiwei Wu, Liwei Guo, Jie Yang, Xiaosong Zhang, Yongzhao Zhang

发表机构 * National University of Singapore(新加坡国立大学) Shanghai Jiao Tong University(上海交通大学) Northwestern Polytechnical University(西北工业大学)

AI总结 提出GaMi系统,利用毫米波和声学传感的跨模态减法解缠框架,在不受约束的几何条件下实现高精度材料识别。

Comments 17 pages, 18 figures

详情
AI中文摘要

非接触式材料识别使具身智能能够进行自适应交互,但面临几何诱导变化(如方向、形状、距离)和单模态模糊性的挑战。本文提出GaMi,一种集成毫米波和声学传感的多模态材料识别系统,可在不受约束的几何条件下稳健运行。利用共置双模态传感器之间共享几何一致性的洞察,GaMi采用样本内跨模态减法解缠框架。通过语义对齐模态并减去共享几何上下文,它隔离了内在材料特征。此外,GaMi引入样本间对比学习以纠正跨模态未对准引起的残余干扰。另外,两种模态之间的配对自适应策略实现了跨设备的少样本泛化。在20种材料上的广泛评估表明,GaMi达到了95.2%的准确率,在未见几何条件下优于单模态基线。

英文摘要

Non-contact material identification enables adaptive interaction for embodied intelligence yet faces challenges from geometry-induced variations (e.g., orientation, shape, distance) and single-modality ambiguities. In this paper, we present GaMi, a multimodal material identification system integrating mmWave and acoustic sensing to robustly operate under unconstrained geometric conditions. By leveraging the insight of shared geometric consistency between co-located bimodal sensors, GaMi employs an intra-sample cross-modal subtractive disentanglement framework. By semantically aligning modalities and subtracting the shared geometric context, it isolates intrinsic material features. Furthermore, GaMi incorporates inter-sample contrastive learning to correct the residual interference caused by cross-modal misalignment. Additionally, a pairing-based adaptation strategy between two modalities enables few-shot generalization across devices. Extensive evaluations on 20 materials show that GaMi achieves 95.2% accuracy, outperforming single-modality baselines across unseen geometric conditions.

2605.30614 2026-06-01 cs.CR cs.SD 版本更新

Audio Pirates: Black-box Audio Watermark Removal via Diffusion Priors

Audio Pirates: 基于扩散先验的黑盒音频水印移除

Lingfeng Yao, Xincong Zhong, Chenpei Huang, Xuandong Zhao, Hanqing Guo, Aohan Li, Jiang Liu, Tomoaki Ohtsuki, Miao Pan

发表机构 * University of Houston(德克萨斯大学休斯敦分校) Waseda University(早稻田大学) University of Hawaii at Mānoa(夏威夷大学马诺阿分校) Keio University(庆应大学) The University of Electro-Communications(电通通信大学)

AI总结 提出黑盒水印移除攻击DiffErase,通过将水印音频扰动到中间扩散噪声水平并利用预训练去噪模型再生,有效去除水印同时保持感知质量。

详情
AI中文摘要

随着AI生成音频的兴起,水印被广泛用于检测滥用和保护知识产权。然而,攻击者可能试图移除这些水印,因此评估水印方案抵御移除攻击的能力至关重要。现有攻击往往不切实际:它们要么显著降低感知质量,要么需要访问水印方案。我们提出DiffErase,一种黑盒水印移除攻击,假设对目标水印方案一无所知,同时保持感知质量。DiffErase将带水印音频扰动到中间扩散噪声水平,并使用预训练去噪模型再生,有效抑制水印信号。理论分析和大量实验表明,不可听音频水印高度脆弱:在多个音频域中,DiffErase始终移除水印同时保持感知质量。这些发现突显了未来音频水印设计需要考虑基于扩散的威胁。代码和演示可在https://differase.github.io/DiffErase/获取。

英文摘要

With the rise of AI-generated audio, watermarking has become widely used for detecting misuse and protecting intellectual property. However, adversaries may try to remove these watermarks, making it critical to evaluate how well watermarking schemes withstand removal attacks. Existing attacks are often impractical: they either noticeably degrade perceptual quality or require access to the watermarking scheme. We propose DiffErase, a black-box watermark removal attack that assumes no knowledge of the target watermarking scheme while maintaining perceptual quality. DiffErase perturbs watermarked audio to an intermediate diffusion noise level and regenerates it using a pretrained denoising model, effectively suppressing watermark signals. Theoretical analysis and extensive experiments demonstrate that inaudible audio watermarks are highly vulnerable: across multiple audio domains, DiffErase consistently removes watermarks while preserving perceptual quality. These findings highlight the need for future audio watermarking designs to consider diffusion-based threats. Code and demos are available at https://differase.github.io/DiffErase/.

2605.30469 2026-06-01 cs.SD cs.CV 版本更新

3DAE: Binaural Quality Assessment for Audio Novel View Synthesis with Spatial Maps and Benchmark

3DAE: 基于空间图谱和基准的音频新视角合成双耳质量评估

Jialu Xu, Yifan Zhou

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出一个全参考诊断框架3DAE Map,通过时频音频误差图(幅度、ILD、IPD、时间对齐、响度和高频故障)进行视觉检查,并构建模型无关基准3DAE Bench,用于评估音频新视角合成模型的双耳预测质量。

详情
AI中文摘要

3D音频和新视角声学合成模型通常使用全局指标进行评估。然而,全局指标往往隐藏了双耳预测失败的位置和原因。我们提出一个全参考诊断框架,该框架使用时频音频误差图,包括幅度、ILD、IPD、时间对齐、响度和高频故障,形成3D音频误差图(3DAE Map)用于视觉检查。我们将这些诊断方法整合到一个模型无关的基准——空间音频误差基准(3DAE Bench)中,该基准接受任意真实和预测的双耳对,并报告音频新视角合成模型的预测质量。在Replay-NVAS和SoundSpaces上对ViGAS输出的实验显示了不同的主要故障模式:Replay-NVAS上的时间错位和SoundSpaces上的ILD不匹配。总体而言,该框架为音频新视角合成模型开发优化提供了可解释的故障模式总结和直观的视觉图谱。

英文摘要

3D audio and novel-view acoustic synthesis models are usually evaluated with global metrics.However, global metrics often hide where and why binaural prediction fails. We propose a full-reference diagnostic framework that uses time-frequency audio error maps for magnitude, ILD, IPD, temporal alignment, loudness, and high-frequency failures, forming a 3D Audio Error Map (3DAE Map) for visual inspection. We frame these diagnostics into a model-agnostic benchmark, Spatial Audio Error Bench (3DAE Bench), which takes arbitrary ground-truth and predicted binaural pairs and reports the prediction quality of audio novel-view synthesis models. Experiments on ViGAS outputs over Replay-NVAS and SoundSpaces show different dominant failure modes: temporal misalignment on Replay-NVAS and ILD mismatch on SoundSpaces. Overall, the framework provides interpretable failure-mode summaries and intuitive visual maps for audio Novel-view-synthesis model development optimization.

2605.30366 2026-06-01 cs.CR cs.SD eess.AS 版本更新

Escaping the Linearity Trap: Manifold Detours for Black-Box Adversarial Attacks on Singing Audio Deepfake Detection

逃离线性陷阱:针对歌声音频深度伪造检测的黑盒对抗攻击的流形绕行策略

Yifan Liao, Yule Liu, Zhen Sun, Zongmin Zhang, Yupeng He, Jiaheng Wei, Xinhu Zheng, Xinlei He

发表机构 * Wuhan University(武汉大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 针对自监督学习(SSL)歌声深度伪造检测(SVDD)的对抗攻击失败源于线性陷阱,提出MARS框架通过双层优化逃离线性陷阱,显著提升黑盒迁移攻击成功率。

详情
AI中文摘要

近期歌声合成(SVS)的进步使得高度逼真但可能恶意的AI翻唱成为可能,因此歌声深度伪造检测(SVDD)至关重要。基于自监督学习(SSL)的检测器通过微调语音SSL骨干网络以捕获歌声特有的伪造痕迹,达到了最先进的性能。现有的对抗攻击通常无法攻破SSL-SVDD,造成其具有内在鲁棒性的错误印象。我们揭示这源于两个挑战。首先,在目标层面,攻击在局部替代模型上优化交叉熵,跨越替代模型特定的决策边界而非抑制共享的伪造证据。其次,在方法层面,攻击遵循替代模型的主导梯度方向。在SSL-SVDD中,这对应于微调后的伪造敏感方向,限制了向未见检测器的迁移性——我们将这种几何失败称为线性陷阱。为了正确评估鲁棒性,我们提出了MARS(语义元对抗回归),一种针对SSL-SVDD的基于迁移的黑盒攻击框架。结构上,MARS通过从预训练SSL空间构建自然语义锚点、从微调空间构建伪造锚点,转向假设-证据操纵。算法上,MARS通过双层优化逃离线性陷阱:内层阶段诱导切向探索,外层阶段引导音频朝向自然语义流形。在CtrSVDD基准上的实验表明,MARS在分布内迁移(13%)、分布外迁移(10%)和跨任务评估(36%)中提升了攻击成功率(ASR),凸显了构建鲁棒SVDD系统的紧迫性。

英文摘要

Recent Singing Voice Synthesis (SVS) advances enable highly realistic but potentially malicious AI covers, making singing voice deepfake detection (SVDD) crucial. Self-Supervised Learning (SSL)-based detectors achieve state-of-the-art performance by fine-tuning speech SSL backbones to capture singing-specific spoof artifacts. Existing adversarial attacks often fail against SSL-SVDD, creating a false impression of inherent robustness. We reveal this stems from two challenges. First, at the objective level, attacks optimize cross-entropy on local surrogates, crossing surrogate-specific boundaries rather than suppressing shared spoof evidence. Second, at the method level, attacks follow the surrogate's dominant gradient direction. In SSL-SVDD, this aligns with fine-tuned artifact-sensitive directions, limiting transferability to unseen detectors - a geometric failure we term the Linearity Trap. To properly evaluate robustness, we propose MARS (Meta-Adversarial Regression of Semantics), a transfer-based black-box framework tailored to SSL-SVDD. Structurally, MARS shifts to hypothesis-evidence manipulation by constructing a natural semantic anchor from the pre-trained SSL space and an artifact anchor from the fine-tuned space. Algorithmically, MARS escapes the Linearity Trap via bi-level optimization: the inner stage induces tangential exploration, while the outer stage guides the audio toward the natural semantic manifold. Experiments on the CtrSVDD benchmark show MARS improves Attack Success Rate (ASR) in in-distribution transfer (13%), out-of-distribution transfer (10%), and cross-task evaluation (36%), highlighting the urgent need for robust SVDD systems.

2605.30365 2026-06-01 cs.SD cs.AI eess.AS 版本更新

Mental Damage: Caption Poisoning Attacks on Retrieval-Augmented Text-to-Music Generation

心理伤害:面向检索增强文本到音乐生成的标题投毒攻击

Yizhu Wen, Shuhao Zhang, Nan Zhang, Long Cheng, Hanqing Guo

发表机构 * Clemson University(克莱姆森大学) Michigan State University(密歇根州立大学)

AI总结 提出双层标题投毒策略,通过向音乐知识库注入少量恶意标题,使检索增强文本到音乐系统生成偏离用户意图的音乐,暴露了系统的完整性风险。

Comments This paper was accepted by the S&P 2026 ArtSec Workshop

详情
AI中文摘要

检索增强文本到音乐(TTM)系统通过从音乐标题数据集中检索的标题来增强未指定的用户提示。这种设计引入了对音乐知识数据库的完整性依赖。我们表明,攻击者可以通过注入少量精心制作的音乐标题来毒化数据库,导致系统检索恶意标题,从而偏置提示增强并使生成偏离用户预期功能,而无需修改用户提示、检索器或生成器。为了实现音乐标题投毒攻击,我们提出了一种双层标题投毒策略,该策略保留高级检索锚点,同时注入低级声学描述符,以将提示增强和下游音乐生成引导至攻击者选择的目标意图。在MusicCaps知识数据库、CLAP检索器和MusicGen流水线中,被投毒的生成结果显著接近攻击者的目标,同时与原始用户查询保持可比的对齐。这些结果暴露了检索增强创意AI系统的实际完整性风险。我们的演示可在以下网址找到:https://yizhu-wen.github.io/Mental-Damage/

英文摘要

Retrieval-augmented text-to-music (TTM) systems augment underspecified user prompts using captions retrieved from a music caption dataset. This design introduces an integrity dependency on the music knowledge database. We show that an attacker can poison the database by injecting a small number of crafted music captions, causing the system to retrieve malicious captions that bias prompt augmentation and steer generation away from the user's intended function, without modifying the user prompt, retriever, or generator. To achieve the music caption poisoning attack, we propose a dual-layer caption poisoning strategy that preserves high-level retrieval anchors while injecting low-level acoustic descriptors to steer prompt augmentation and downstream music generation toward an attacker-chosen target intent. In a MusicCaps knowledge database, CLAP retriever, and MusicGen pipeline, poisoned generations move substantially closer to the attacker's target, while remaining comparably aligned with the original user query. These results expose a practical integrity risk for retrieval-augmented creative AI systems. Our demo can be found at: https://yizhu-wen.github.io/Mental-Damage/

2605.24863 2026-06-01 eess.AS cs.SD 版本更新

Rethinking Continual Learning for Speech and Audio: A Representation-Centric Taxonomy and Open Problems

重新思考语音和音频的持续学习:基于表征的分类法与开放问题

Yang Xiao, Siyi Wang, Eun-Jung Holden, Ting Dang

发表机构 * University of Melbourne, Melbourne, Australia(墨尔本大学)

AI总结 本文从表征中心视角重新审视语音持续学习,提出基于表征几何演化的新分类法,并指出现有假设与语音基础模型行为的关键不匹配,最后概述开放挑战与未来方向。

Comments 4 pages, 1 figure, working in process

详情
AI中文摘要

语音和音频系统运行在本质非平稳的环境中,然而该领域的持续学习(CL)研究,尤其是在基础模型时代,仍然零散,未能考虑声学表征的耦合性和几何敏感性。现代语音基础模型在高度纠缠的连续表征上操作,这些表征在共享的潜在空间中联合编码语言、说话人和副语言因素。因此,CL从根本上关乎保留和演化共享的表征结构,而非保留孤立的任务知识。在这项工作中,我们从表征中心视角重新审视语音的CL,并引入一种新的分类法,根据底层表征几何在非平稳声学条件下的演化方式来组织CL。我们进一步指出现有CL假设与语音基础模型行为之间的关键不匹配,最后概述一系列开放挑战和未来研究方向。

英文摘要

Speech and audio systems operate in inherently non-stationary environments, yet continual learning (CL) research in this domain, especially in the foundation model era, remains fragmented that fail to account for the coupled, geometry-sensitive nature of acoustic representations. Modern speech foundation models operate over highly entangled, continuous representations that jointly encode linguistic, speaker, and paralinguistic factors within a shared latent space. CL is therefore fundamentally about preserving and evolving shared representation structure rather than retaining isolated task knowledge. In this work, we revisit CL for speech from a representation-centered perspective, and introduce a new taxonomy that organizes CL according to how underlying representation geometry evolves under non-stationary acoustic conditions. We further identify key mismatches between current CL assumptions and speech foundation model behavior, and finally outline a set of open challenges and future research directions.

2603.10468 2026-06-01 eess.AS cs.AI cs.HC cs.MM cs.SD 版本更新

G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition

G-STAR: 端到端全局说话人跟踪属性识别

Jing Peng, Ziyi Chen, Haoyu Li, Yucheng Wang, Duo Ma, Mengtian Li, Yunfan Du, Dezhu Xu, Kai Yu, Shuai Wang

发表机构 * Nanjing University(南京大学) Shanghai Jiao Tong University(上海交通大学) Central Media Technology Institute, Huawei(华为中央媒体技术研究院) Shenzhen Research Institute of Big Data(深圳大数据研究院) ETH Zürich(苏黎世联邦理工学院)

AI总结 提出G-STAR框架,通过缓存条件说话人跟踪模块与Speech-LLM转录骨干耦合,实现长时重叠多说话人语音的端到端说话人属性识别,支持组件优化和联合训练,在局部和全局评估中均表现优异。

Comments submitted to Emnlp 2026

详情
AI中文摘要

我们研究了带时间戳的说话人属性自动语音识别(SA-ASR),针对长时、多说话人且存在重叠的语音。在此设置中,分块推理必须保持会议级别的说话人身份一致性,同时生成带时间戳和说话人标签的转录。先前的Speech-LLM系统倾向于优先考虑局部日志或全局标签,缺乏联合建模细粒度时间边界和鲁棒跨块身份链接的能力。我们提出G-STAR,一个端到端框架,将缓存条件的说话人跟踪模块与Speech-LLM转录骨干耦合。跟踪器提供具有时间基础的结构化说话人线索,LLM基于这些线索生成属性文本。G-STAR支持组件优化和联合端到端训练,能够在异构监督和领域偏移下进行灵活学习。在分块解码协议下,基于预言分割的局部评估和全会议全局评估的实验均显示出强大的说话人属性转录性能。

英文摘要

We study timestamped speaker-attributed automatic speech recognition (SA-ASR) for long-form, multi-party speech with overlap. In this setting, chunk-wise inference must preserve meeting-level speaker identity consistency while producing time-stamped, speaker-labeled transcripts. Prior Speech-LLM systems tend to prioritize either local diarization or global labeling, lacking the ability to jointly model fine-grained temporal boundaries and robust cross-chunk identity linking. We propose G-STAR, an end-to-end framework that couples a cache-conditioned speaker-tracking module with a Speech-LLM transcription backbone. The tracker provides structured speaker cues with temporal grounding, and the LLM generates attributed text conditioned on these cues. G-STAR supports component-wise optimization and joint end-to-end training, enabling flexible learning under heterogeneous supervision and domain shift. Under chunk-wise decoding protocols, experiments on both oracle-segmented local evaluation and full-meeting global evaluation show strong speaker-attributed transcription performance.

2603.07551 2026-06-01 cs.SD cs.AI 版本更新

Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

零样本文本转语音中的目标说话人投毒框架

Thanapat Trachu, Thanathai Lertpetchpun, Sai Praneeth Karimireddy, Shrikanth Narayanan

发表机构 * Thomas Lord Department of Computer Science, University of Southern California, USA(汤姆斯·劳德计算机科学系,美国南加州大学) Signal Analysis and Interpretation Lab, University of Southern California, USA(信号分析与解释实验室,美国南加州大学)

AI总结 针对零样本TTS语音克隆的隐私风险,提出说话人生成投毒(SGSP)任务,通过修改训练模型阻止特定身份生成,并评估了推理时过滤和参数修改基线在1、15和100个遗忘说话人上的隐私-效用权衡。

Comments Submitted to Interspeech2026

详情
AI中文摘要

零样本文本转语音(TTS)语音克隆带来了严重的隐私风险,需要从训练好的TTS模型中移除特定说话人身份。传统的机器遗忘在此情境下不足,因为零样本TTS可以从仅参考提示动态重建声音。我们将此任务形式化为说话人生成投毒(SGSP),其中我们修改训练模型以防止生成特定身份,同时保留其他说话人的效用。我们评估了推理时过滤和参数修改基线在1、15和100个遗忘说话人上的表现。通过效用(WER)和隐私之间的权衡来评估性能,隐私使用AUC和遗忘说话人相似度(FSSIM)量化。我们在最多15个说话人上实现了强隐私,但由于身份重叠增加,在100个说话人时揭示了可扩展性限制。因此,我们的研究引入了一个新颖的问题和评估框架,以推动生成式语音隐私的进一步进展。

英文摘要

Zero-shot Text-to-Speech (TTS) voice cloning poses severe privacy risks, demanding the removal of specific speaker identities from trained TTS models. Conventional machine unlearning is insufficient in this context, as zero-shot TTS can dynamically reconstruct voices from just reference prompts. We formalize this task as Speech Generation Speaker Poisoning (SGSP), in which we modify trained models to prevent the generation of specific identities while preserving utility for other speakers. We evaluate inference-time filtering and parameter-modification baselines across 1, 15, and 100 forgotten speakers. Performance is assessed through the trade-off between utility (WER) and privacy, quantified using AUC and Forget Speaker Similarity (FSSIM). We achieve strong privacy for up to 15 speakers but reveal scalability limits at 100 speakers due to increased identity overlap. Our study thus introduces a novel problem and evaluation framework toward further advances in generative voice privacy.

2602.16305 2026-06-01 cs.SD cs.LG 版本更新

BAT: Better Audio Transformer Guided by Convex Gated Probing

BAT: 基于凸门控探测的更好音频Transformer

Houtan Ghaffari, Lukas Rauch, Christoph Scholz, Paul Devos

发表机构 * Ghent University(根特大学) University of Kassel(卡塞尔大学)

AI总结 提出凸门控探测(CGP)方法,通过门控机制有效利用所有冻结层,缩小音频自监督学习中探测与微调的差距,并基于CGP改进SSL流程,构建Better Audio Transformer(BAT),在音频基准上取得新最优结果。

Comments Accepted @ ICML26

详情
AI中文摘要

探测在计算机视觉中被广泛用于忠实评估自监督学习(SSL)嵌入,因为微调可能扭曲其内在质量。相比之下,音频SSL模型仍依赖微调,因为简单探测无法充分发挥其潜力,并在AudioSet竞争时改变排名。因此,需要一种稳健高效的探测机制来引导音频SSL走向可靠和可重复的方法。我们引入凸门控探测(CGP),一种基于原型的方法,显著缩小了音频中微调和探测之间的差距。CGP通过门控机制高效利用所有冻结层,并揭示潜在任务相关信息的所在位置。以CGP作为可靠的事后评估探测为指导,我们重新设计了当前最佳音频模型的整个SSL流程,这些模型使用了先前SSL方法的遗留实现。通过改进数据预处理、模型架构和预训练方案,我们推出了Better Audio Transformer(BAT),并在音频基准上建立了新的最优结果。

英文摘要

Probing is widely adopted in computer vision to faithfully evaluate self-supervised learning (SSL) embeddings, as finetuning may misrepresent their inherent quality. In contrast, audio SSL models still rely on finetuning because simple probing fails to unlock their full potential and alters their rankings when competing on AudioSet. Hence, a robust and efficient probing mechanism is required to guide the trajectory of audio SSL towards reliable and reproducible methods. We introduce Convex Gated Probing (CGP), a prototype-based method that significantly closes the gap between finetuning and probing in audio. CGP efficiently utilizes all frozen layers via a gating mechanism and exposes the location of latent task-relevant information. Guided by CGP as a reliable post-hoc evaluation probe, we rework the entire SSL pipeline of current best performing audio models that use legacy implementations of prior SSL methods. By refining data preprocessing, model architecture, and pretraining recipe, we introduce Better Audio Transformer (BAT), and establish new SOTA on audio benchmarks.

2601.13704 2026-06-01 cs.SD cs.AI cs.LG eess.AS 版本更新

Performance and Complexity Trade-off Optimization of Speech Models During Training

训练过程中语音模型的性能与复杂度权衡优化

Esteban Gómez, Tom Backström

发表机构 * Department of Information and Communications Engineering, Aalto University(信息与通信工程系,艾尔托大学)

AI总结 提出一种基于特征噪声注入的重新参数化技术,利用随机梯度下降方法在训练中联合优化语音模型的性能和计算复杂度,实现动态模型大小调整。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

在语音机器学习中,神经网络模型通常通过选择具有固定层大小和结构的架构来设计。这些模型随后被训练以最大化与任务目标相关的性能指标。虽然整体架构通常由任务的先验知识指导,但各层的大小往往是启发式选择的。然而,这种方法并不能保证性能与计算复杂度之间的最优权衡;因此,通常采用权重量化或模型剪枝等后处理方法以降低计算成本。这是因为随机梯度下降(SGD)方法只能优化可微函数,而影响计算复杂度的因素(如层大小和每秒浮点运算次数(FLOP/s))是不可微的,需要在训练过程中修改模型结构。我们提出了一种基于特征噪声注入的重新参数化技术,使得在训练过程中能够使用基于SGD的方法联合优化性能和计算复杂度。与传统的剪枝方法不同,我们的方法允许模型大小针对目标性能-复杂度权衡进行动态优化,而无需依赖启发式标准来选择要移除的权重或结构。我们通过三个案例研究证明了我们方法的有效性,包括一个合成示例和两个实际应用:语音活动检测和音频反欺骗。与我们的工作相关的代码已公开,以鼓励进一步研究。

英文摘要

In speech machine learning, neural network models are typically designed by choosing an architecture with fixed layer sizes and structure. These models are then trained to maximize performance on metrics aligned with the task's objective. While the overall architecture is usually guided by prior knowledge of the task, the sizes of individual layers are often chosen heuristically. However, this approach does not guarantee an optimal trade-off between performance and computational complexity; consequently, post hoc methods such as weight quantization or model pruning are typically employed to reduce computational cost. This occurs because stochastic gradient descent (SGD) methods can only optimize differentiable functions, while factors influencing computational complexity, such as layer sizes and floating-point operations per second (FLOP/s), are non-differentiable and require modifying the model structure during training. We propose a reparameterization technique based on feature noise injection that enables joint optimization of performance and computational complexity during training using SGD-based methods. Unlike traditional pruning methods, our approach allows the model size to be dynamically optimized for a target performance-complexity trade-off, without relying on heuristic criteria to select which weights or structures to remove. We demonstrate the effectiveness of our method through three case studies, including a synthetic example and two practical real-world applications: voice activity detection and audio anti-spoofing. The code related to our work is publicly available to encourage further research.

2509.24901 2026-06-01 cs.SD cs.LG 版本更新

Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification

取消补丁令牌静音:重新审视多标签音频分类中的探测方法

Lukas Rauch, René Heinrich, Houtan Ghaffari, Lukas Miklautz, Ilyass Moummad, Bernhard Sick, Christoph Scholz

发表机构 * University of Kassel(卡塞尔大学) Fraunhofer IEE(弗劳恩霍夫研究所) Ghent University(根特大学) ML and Systems Biology, MPI of Biochemistry(生物化学Max Planck研究所) INRIA Montpellier(蒙彼利埃INRIA)

AI总结 针对自监督音频模型线性探测性能不佳的问题,提出二值化原型探测方法,通过学习原型进行类别级信息聚合,在13个数据集上超越线性探测和注意力探测,建立探测作为高效评估范式的可行性。

Comments Accepted @ ICLR26

详情
AI中文摘要

尽管探测冻结模型已成为标准评估范式,但音频中的自监督学习在追求AudioSet上的最优性能时默认采用微调。一个关键原因是全局池化造成信息瓶颈,导致线性探测错误地表示嵌入质量:$\texttt{cls}$-token丢弃了关于音频中分散、局部事件的关键令牌信息。这一弱点根源于预训练目标(全局)与下游任务(局部)之间的不匹配。在包含13个数据集和6个基于频谱图的编码器的综合基准测试中,我们研究了全局池化瓶颈。我们引入了二值化原型探测:一种轻量级且简单的池化方法,通过学习原型进行类别级信息聚合。尽管简单,我们的方法显著优于线性探测和注意力探测。我们的工作将探测确立为评估音频SSL模型的一种有竞争力且高效的范式,挑战了对昂贵微调的依赖。

英文摘要

Although probing frozen models has become a standard evaluation paradigm, self-supervised learning in audio defaults to fine-tuning when pursuing state-of-the-art on AudioSet. A key reason is that global pooling creates an information bottleneck causing linear probes to misrepresent the embedding quality: The $\texttt{cls}$-token discards crucial token information about dispersed, localized events in audio. This weakness is rooted in the mismatch between the pretraining objective (globally) and the downstream task (localized). Across a comprehensive benchmark of 13 datasets and 6 spectrogram-based encoders, we investigate the global pooling bottleneck. We introduce binarized prototypical probes: a lightweight and simple pooling method that learns prototypes to perform class-wise information aggregation. Despite its simplicity, our method notably outperforms linear and attentive probing. Our work establishes probing as a competitive and efficient paradigm for evaluating audio SSL models, challenging the reliance on costly fine-tuning.

2510.20853 2026-06-01 eess.AS cs.CL cs.SD 版本更新

Beyond Hearing: Learning Task-Agnostic ExG Representations from Earphones via Physiology-Informed Tokenization

超越听觉:通过生理学启发的标记化从耳机学习任务无关的ExG表示

Hyungjun Yoon, Seungjoo Lee, Yu Yvonne Wu, Xiaomeng Chen, Taiting Lu, Freddy Yifei Liu, Taeckyung Lee, Hyeongheon Cha, Haochen Zhao, Gaoteng Zhao, Dongyao Chen, Cecilia Mascolo, Sung-Ju Lee, Lili Qiu

发表机构 * KAIST(韩国科学技术院) Carnegie Mellon University(卡内基梅隆大学) University of Cambridge(剑桥大学) Shanghai Jiao Tong University(上海交通大学) Pennsylvania State University(宾夕法尼亚州立大学) UCLA(加州大学洛杉矶分校) Northwest University(北华大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) Microsoft Research(微软研究院)

AI总结 提出一种基于耳机的生理学启发的多频带标记化方法(PiMT),通过无干扰的日常ExG数据采集和重建任务学习鲁棒表示,实现跨多种任务(包括五种人类感官)的通用ExG监测。

Comments Accepted to ICLR 2026

详情
AI中文摘要

电生理(ExG)信号为人类生理学提供了有价值的见解,但由于两个关键限制,构建能够泛化到日常任务的基础模型仍然具有挑战性:(i)数据多样性不足,因为大多数ExG记录是在受控实验室中使用笨重、昂贵的设备收集的;以及(ii)任务特定的模型设计需要定制的处理(即目标频率滤波器)和架构,这限制了跨任务的泛化。为了解决这些挑战,我们引入了一种可扩展的、任务无关的野外ExG监测方法。我们使用基于耳机的硬件原型收集了50小时的无干扰自由生活ExG数据,以缩小数据多样性差距。我们方法的核心是生理学启发的多频带标记化(PiMT),它将ExG信号分解为12个生理学启发的标记,然后通过重建任务学习鲁棒的表示。这使得能够在全频谱范围内进行自适应特征识别,同时捕获任务相关信息。在我们新的DailySense数据集(第一个支持基于ExG的五种人类感官分析的数据集)以及四个公共ExG基准上的实验表明,PiMT在多种任务上始终优于最先进的方法。

英文摘要

Electrophysiological (ExG) signals offer valuable insights into human physiology, yet building foundation models that generalize across everyday tasks remains challenging due to two key limitations: (i)~insufficient data diversity, as most ExG recordings are collected in controlled labs with bulky, expensive devices; and (ii)~task-specific model designs that require tailored processing (i.e., targeted frequency filters) and architectures, which limit generalization across tasks. To address these challenges, we introduce an approach for scalable, task-agnostic ExG monitoring in the wild. We collected 50 hours of unobtrusive free-living ExG data with an earphone-based hardware prototype to narrow the data diversity gap. At the core of our approach is Physiology-informed Multi-band Tokenization (PiMT), which decomposes ExG signals into 12 physiology-informed tokens, followed by a reconstruction task to learn robust representations. This enables adaptive feature recognition across the full frequency spectrum while capturing task-relevant information. Experiments on our new DailySense dataset, the first to enable ExG-based analysis across five human senses, together with four public ExG benchmarks, demonstrate that PiMT consistently outperforms state-of-the-art methods across diverse tasks.

2509.14789 2026-06-01 eess.AS cs.CR cs.SD eess.SP 版本更新

Acoustic Simulation Framework for Multi-channel Replay Speech Detection

多通道重放语音检测的声学仿真框架

Michael Neri, Tuomas Virtanen

发表机构 * Faculty of Information Technology and Communication Sciences(信息科技与通讯科学学院)

AI总结 提出一个利用公开资源模拟多通道重放语音配置的声学仿真框架,训练M-ALRAD检测器并扩展其利用通道间相位差特征,在无真实训练数据下于ReMASC语料库上评估泛化能力。

Comments Submitted to IEEE MMSP 2026

详情
AI中文摘要

重放语音攻击对语音控制系统构成重大威胁,尤其是在广泛部署语音助手的智能环境中。虽然多通道音频提供了空间线索,可以增强重放检测的鲁棒性,但现有的数据集和方法主要依赖于单通道录音。此外,先前的研究强调,这种攻击对新环境的泛化具有挑战性,需要新的方法来生成涵盖各种声学条件的数据。因此,在这项工作中,我们引入了一个声学仿真框架,旨在使用公开可用的资源模拟多通道重放语音配置。利用该框架,我们训练了最先进的多通道重放检测器M-ALRAD,并在没有任何真实训练数据的情况下,在ReMASC真实录音语料库上评估其泛化能力。为了改进空间信息的利用,我们为M-ALRAD扩展了相邻麦克风对之间计算的通道间相位差特征,用方向线索增强波束形成表示。合成数据集将在论文被接收后提供。

英文摘要

Replay speech attacks pose a significant threat to voice-controlled systems, especially in smart environments where voice assistants are widely deployed. While multi-channel audio offers spatial cues that can enhance replay detection robustness, existing datasets and methods predominantly rely on single-channel recordings. Moreover, previous studies highlighted that generalization of this attack to new environments is challenging, requiring new methods for generating data encompassing various acoustic conditions. Hence, in this work we introduce an acoustic simulation framework designed to simulate multi-channel replay speech configurations using publicly available resources. Using the framework, we train the state-of-the-art multi-channel replay detector M-ALRAD and evaluate its generalisation on the ReMASC real-recording corpus without any real training data. To improve the exploitation of spatial information, we extend M-ALRAD with inter-channel phase difference features computed for adjacent microphone pairs, augmenting the beamformed representation with directional cues. Synthetic datasets will be available upon acceptance of the paper.