arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.27346 2026-05-27 cs.SD 版本更新

MERIT: Learning Disentangled Music Representations for Audio Similarity

MERIT: 学习用于音频相似性的解耦音乐表示

Abhinaba Roy, Junyi Liang, Dorien Herremans

发表机构 * MERIT: Learning Disentangled Music Representations for Audio Similarity(MERIT:学习解耦音乐表示以实现音频相似性)

AI总结 针对现有音乐相似性模型将旋律、节奏和音色等维度纠缠在一起的问题,提出MERIT框架,通过条件音频生成和源分离茎的训练策略学习解耦的因子特定表示,实现各维度独立响应。

详情
AI中文摘要

当前的音乐相似性模型通常计算单一的、整体的分数,将旋律、节奏和音色等不同的音乐维度纠缠在一起。这限制了用户的控制和可解释性,使得无法执行细微的查询。我们引入了MERIT,一个学习针对这三个核心维度的解耦、因子特定音乐表示的框架。为了克服真实音频中缺乏孤立音乐变化的问题,我们使用了一种新颖的训练策略,该策略利用条件音频生成和源分离茎来强烈鼓励训练数据中的单因子变化。我们的评估展示了强大的因子级解耦。每个头部对其预期的感知维度响应强烈,而在其他维度上几乎保持随机,这种表示属性在合成训练域和独立的真实音频中均成立。

英文摘要

Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.

2605.27190 2026-05-27 cs.CL cs.AI cs.LG cs.SD 版本更新

Learning When to Think While Listening in Large Audio-Language Models

在大音频语言模型中学习何时在聆听时思考

Zhiyuan Song, Weici Zhao, Yang Xiao, Suhao Yu, Cheng Zhu, Jiatao Gu

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出一种可学习的等待-思考-回答控制机制,通过多奖励强化学习优化大音频语言模型在流式语音交互中的推理时机,在提升准确率的同时减少响应延迟。

Comments 19 pages, 4 figures, 6 tables

详情
AI中文摘要

近期大音频语言模型(LALMs)的进展使得实时、流式的语音交互越来越实用。在这种场景下,推理质量和响应速度紧密耦合:将推理延迟到语音端点可以提高答案质量,但会将思考时间转移到用户可见的响应延迟中,而过早回答则可能在决定性证据到达之前做出承诺。我们为LALMs引入了一种可学习的等待-思考-回答控制公式。受人类对话渐进性启发,控制器在部分音频证据下决定何时等待、何时外化紧凑的推理更新、以及何时回答。以Qwen2.5-Omni-7B为基础模型,我们从语音推理数据中构建对齐的等待-思考-回答轨迹,使用监督微调(SFT)训练控制器,然后应用解耦裁剪和动态采样策略优化(DAPO)。奖励结合了答案正确性、动作有效性、更新时机、延迟同步、推理质量和链一致性,优化完整的等待-思考-回答轨迹,而不仅仅是最终答案。在一个六任务合成语音推理问答(SRQA)基准上,六奖励DAPO控制器将行加权准确率从67.6%提升到70.3%,同时在相同Qwen部署环境下将端点后最终思考长度减少14%。在一个包含186个人类录音的真实音频基准(Real Audio Bench)上,作为超越文本转语音(TTS)渲染语音的迁移检查,控制器家族仍然有效:SFT实现了最强的准确率,而六奖励DAPO控制器是唯一最终思考长度低于基础模型的学习变体。这些结果表明,流式模型应该学习在音频流中何时使中间推理显式化。

英文摘要

Recent advances in Large Audio-Language Models (LALMs) have made real-time, streaming spoken interaction increasingly practical. In this setting, reasoning quality and responsiveness are tightly coupled: delaying reasoning until the speech endpoint can improve answer quality but moves deliberation into user-visible response delay, while answering too early risks committing before decisive evidence arrives. We introduce a learnable wait-think-answer control formulation for LALMs. Motivated by the incremental nature of human conversation, the controller decides under partial audio evidence when to wait, when to externalize a compact reasoning update, and when to answer. Using Qwen2.5-Omni-7B as the base model, we construct aligned wait-think-answer traces from spoken reasoning data, train the controller with supervised fine-tuning (SFT), and then apply Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO). The reward combines answer correctness, action validity, update timing, latency synchronization, reasoning quality, and chain consistency, optimizing the complete wait-think-answer trajectory and not the final answer alone. On a six-task synthetic spoken reasoning question answering (SRQA) benchmark, the six-reward DAPO controller improves the row-weighted accuracy from 67.6% to 70.3% while reducing post-endpoint final-think length by 14% under the same Qwen deployment harness. On a 186-item human-recorded Real Audio Bench, a transfer check beyond text-to-speech (TTS)-rendered speech, the controller family remains functional: SFT achieves the strongest accuracy, while the six-reward DAPO controller is the only learned variant whose final-think length falls below the base. These results suggest that a streaming model should learn when to make intermediate reasoning explicit during the audio stream.

2605.27189 2026-05-27 cs.CL cs.LG cs.SD eess.AS q-bio.NC 版本更新

Beyond Binary: Speech Representations Across the Cognitive Score Hierarchy

超越二元:认知评分层级中的语音表征

Serli Kopar, Roshan Prakash Rane, Christian Mychajliw, Lydia Federmann, Gerhard Eschweiler, Daniela Berg, Sam Gijsen, Paula Andrea Perez-Toro, Kerstin Ritter

发表机构 * 1 Hertie Institute for AI in Brain Health, University of Tübingen, Tübingen, Germany 2 Tübingen AI Center, University of Tübingen, Tübingen, Germany 3 Department of Psychology, Humboldt-Universität zu Berlin 4 Geriatric Center, Tübingen University Hospital, Tübingen, Germany 5 Tübingen Center for Mental Health (TüCMH), Department of Psychiatry Psychotherapy, Tübingen University Hospital, Tübingen, Germany 6 German Center for Mental Health (DZPG), Partner Site Tübingen, Tübingen, Germany 7 Department of Neurology, University Medical Center Schleswig-Holstein Kiel University, Kiel, Germany 8 Center for Neurology, University Hospital Tübingen Hertie Institute for Clinical Brain Research, Tübingen, Germany 9 Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany 10 Charit\'e--Universit\"atsmedizin, Department of Psychiatry

AI总结 本研究利用5,754份德语神经心理学评估录音,比较手工声学特征与自监督学习嵌入在轻度认知障碍认知评估层级(任务、领域、全局)中的表现,发现任务约束与评估层级之间的关联。

详情
AI中文摘要

本研究考察了轻度认知障碍中语音表征与认知评估层级结构之间的关系。利用5,754份德语神经心理学评估录音,我们在三个评分层级(任务、领域和全局)上评估了六项认知任务。我们比较了手工声学特征与自监督学习(SSL)嵌入。结果表明,尽管SSL表示在较低层级通常优于手工特征,但这种趋势在MCI分类中发生逆转。此外,任务特定约束影响性能:响应自由度较大的任务随着层级增加表现出性能稀释,表明“专家”表示,而高度结构化任务的性能向更高层级增加,表明“通才”表示。这些发现揭示了自动临床语音分析中任务约束与评估层级之间的联系。

英文摘要

This study examines the relationship between speech representations and the hierarchical structure of cognitive assessment in mild cognitive impairment. Utilizing 5,754 German neuropsychological assessment recordings, we evaluate six cognitive tasks across three score levels: task, domain, and global levels. We compare hand-crafted acoustic features with self-supervised learning (SSL) embeddings. Results show that although SSL representations generally outperform hand-crafted features at lower levels, this trend reverses for MCI classification. Furthermore, task-specific constraints influence performance: tasks with greater response freedom exhibit performance dilution as hierarchical levels increase, suggesting ``specialist'' representations, whereas the performance of highly structured tasks increases toward higher levels, suggesting ``generalist'' representations. These findings show links between task constraints and assessment hierarchy in automated clinical speech analysis.

2605.27174 2026-05-27 cs.SD cs.AI cs.CY 版本更新

An investigation of AI integration in sound designer workflows and experiences

AI在声音设计师工作流程与体验中的整合研究

Nelly Garcia, Joshua Reiss

发表机构 * Queen Mary University of London(伦敦大学玛丽女王学院)

AI总结 通过混合方法研究(76人调查+20人访谈),发现当前AI工具在快速消费媒体中表现良好,但缺乏高端声音设计所需的叙事复杂性,从业者偏好辅助性、任务特定的应用,而非端到端生成系统。

详情
AI中文摘要

人工智能正越来越多地被整合到专业音频制作工作流程中,然而开发者生产的工具与实际声音设计师的需求之间仍存在差距。本文通过一项混合方法研究调查了这一差距,包括对76名从业者的调查以及对20名行业专业人士的后续半结构化访谈。使用描述性统计分析和主题分析对结果进行分析,以识别两个数据集中的模式。我们的分析得出了五个主题:上下文、工作流程、潜力、风险和正确使用。我们的工作表明,当前的AI工具在快速消费媒体环境中表现良好,但缺乏高端声音设计(电影、沉浸式体验等)所需的叙事复杂性。从业者表现出对辅助性、任务特定应用的偏好,特别是在音频修复和库管理方面,而不是端到端生成系统。这项工作为创意产业中AI及AI增强工具的使用正在进行的讨论做出了贡献。我们从声音设计师和创意音频从业者的角度报告了该领域的当前状况,并根据我们的发现为声音技术专家和开发者提供了一系列建议,以指导开发更明智的AI声音设计工具。

英文摘要

Artificial intelligence is increasingly being integrated into professional audio production workflows, yet a gap persists between the tools developers produce and the requirements of practising sound designers. This paper investigates this gap through a mixed-methods study comprising a survey of 76 practitioners and follow-up semi-structured interviews with 20 industry professionals. Results were analysed using descriptive statistical analysis and thematic analysis to identify patterns across both datasets. Five themes emerged from our analysis: Context, Workflow, Potential, Risks, and Right Use. Our work indicates that current AI tools perform adequately in fast-consumption media contexts but lack the narrative sophistication required for high-end sound design (films, immersive experiences etc). Practitioners demonstrate a preference for assistive, task-specific applications, particularly in audio restoration and library management, over end-to-end generative systems. This work contributes to the on-going discussion on the use of AI and AI-enhanced tools in the creative industries. We report on the current status of the field from the point of view of sound designers and creative audio practitioners, and offer a set of recommendation for sound technologist and developers based on our findings to guide the development of more informed AI tools for sound design.

2605.27039 2026-05-27 eess.AS cs.SD 版本更新

Why Can't They Remember? Uncovering Representation and Retrieval Bottlenecks in Multi-Turn Acoustic Memory

为什么它们记不住?揭示多轮声学记忆中的表征和检索瓶颈

Yang Xiao, Siyi Wang, Han Yin, Hong Jia, Vidhyasaharan Sethu, Eun-Jung Holden, Ting Dang

发表机构 * The University of Melbourne(墨尔本大学) KAIST(韩国科学技术院) The University of Auckland(奥克兰大学) UNSW Sydney(新南威尔士大学悉尼分校)

AI总结 本文通过引入EnvMem基准,发现大型音频语言模型在多轮交互中非语音信息记忆失败的主要原因是表征轨迹漂移,而非注意力分配不足。

详情
AI中文摘要

大型音频语言模型(LALMs)处理语音和环境声学线索,但在多轮交互中难以保留非语音信息。语义(语音)和声学(非语音)理解之间的性能差距仍未被充分理解,其表征和检索的底层机制尚不清楚。本文引入EnvMem,一个受控的多轮基准,用于研究这一差距并识别表征(即潜在嵌入)和检索层面(即注意力分配)失败的根源。我们进一步进行事后干预以探究表征结构和注意力动态。我们的结果揭示表征轨迹漂移是关键失败模式,同时表明注意力分配在解释观察到的退化中作用有限。总体而言,我们提供了一个系统框架,用于分析和改进长上下文LALMs中的非语言记忆,为未来鲁棒声学记忆建模的数据和训练设计提供启示。

英文摘要

Large audio language models (LALMs) process both speech and environmental acoustic cues, yet struggle to retain non-speech information across multi-turn interactions. The performance gap between semantic (speech) and acoustic (non-speech) understanding remains poorly understood, and the underlying mechanisms of representation and retrieval are still unclear. This work introduces EnvMem, a controlled multi-turn benchmark designed to study this gap and identify the root causes of failures at the representation (i.e., latent embeddings) and retrieval levels (i.e., attention allocation). We further conduct post-hoc interventions to probe representational structure and attention dynamics. Our results reveal representational trajectory drift as the key failure mode, while showing that attention allocation plays a limited role in explaining the observed degradation. Overall, we provide a systematic framework for analyzing and improving non-linguistic memory in long-context LALMs, shedding light on future data and training design for robust acoustic memory modeling.

2605.26978 2026-05-27 cs.CL cs.SD 版本更新

PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech

PashtoTTS-Bench:低资源非拉丁文字文本转语音的自动化筛选

Hanif Rahman

发表机构 * Independent Researcher(独立研究员)

AI总结 针对低资源非拉丁文字TTS评估中单一ASR往返WER的不足,提出INSV报告框架及其自动化筛选子集INSV-A,并实例化为PashtoTTS-Bench基准,通过多指标评估多个TTS系统。

详情
AI中文摘要

对于低资源非拉丁文字语言,当文本转语音(TTS)评估依赖于单一的ASR往返词错误率(WER)时可能会失败。系统可能不产生音频、说出邻近语言、仅在ASR转录中保留目标文字脚本,或者对母语者来说听起来不自然。我们引入了INSV(可懂度、自然度、脚本保真度和验证)报告框架,将这些情况分开。本文报告了INSV-A,即自动化筛选子集:合成完成度、ASR WER/CER、转录脚本保真率和音频语言识别。原生MOS和语音标注已指定但未在此版本中声明。我们将INSV-A实例化为PashtoTTS-Bench,一个针对普什图语TTS的带日期基准。2026年4月至5月的运行评估了Edge GulNawaz、Edge Latifa、OmniVoice clone、OmniVoice auto和一个乌尔都语阴性对照,使用200个FLEURS和200个过滤后的Common Voice 24提示。在独立的omniASR_CTC_300M_v2下,OmniVoice auto的WER最低(FLEURS 24.1%,CV24 27.4%),其次是Edge GulNawaz(32.8%,39.5%)、Edge Latifa(35.6%,47.7%)和OmniVoice clone(45.4%,34.8%)。低于自然语音基线的WER反映了干净的合成音频,不应被解读为优于原生语音。Whisper Large V3在检查的普什图语TTS音频上返回0.0%的普什图语标签,而MMS-LID-4017和SpeechBrain VoxLingua107将普什图语输出与乌尔都语对照区分开。该版本提供了提供者元数据、每句分数、LID审计、失败日志和用于添加系统的脚本。

英文摘要

Text-to-speech (TTS) evaluation for low-resource non-Latin-script languages can fail when it relies on a single ASR round-trip word error rate (WER). A system may produce no audio, speak a neighbouring language, preserve target script text only in an ASR transcript, or sound unnatural to native listeners. We introduce INSV (Intelligibility, Naturalness, Script fidelity, and Verification), a reporting framework that separates these cases. This paper reports INSV-A, the automated screening subset: synthesis completion, ASR WER/CER, transcript Script Fidelity Rate, and audio language identification. Native MOS and phonetic annotation are specified but not claimed in this release. We instantiate INSV-A as PashtoTTS-Bench, a dated benchmark for Pashto TTS. The April-May 2026 run evaluates Edge GulNawaz, Edge Latifa, OmniVoice clone, OmniVoice auto, and an Urdu negative control on 200 FLEURS and 200 filtered Common Voice 24 prompts. Under the independent omniASR_CTC_300M_v2, OmniVoice auto has the lowest WER (24.1% FLEURS, 27.4% CV24), followed by Edge GulNawaz (32.8%, 39.5%), Edge Latifa (35.6%, 47.7%), and OmniVoice clone (45.4%, 34.8%). WER below the natural-speech baseline reflects clean synthetic audio and should not be read as better than native speech. Whisper Large V3 returns 0.0% Pashto labels on checked Pashto TTS audio, while MMS-LID-4017 and SpeechBrain VoxLingua107 separate Pashto outputs from the Urdu control. The release provides provider metadata, per-sentence scores, LID audits, failure logs, and scripts for adding systems.

2605.25930 2026-05-27 cs.SD 版本更新

CosyEdit2: Speech-Editing-Oriented Reinforcement Learning Unlocks Better Zero-Shot TTS

CosyEdit2: 面向语音编辑的强化学习解锁更好的零样本TTS

Junyang Chen, Yuhang Jia, Hui Wang, Jiaming Zhou, Yongchang Gan, Yong Qin

发表机构 * College of Computer Science, Nankai University(南开大学计算机科学学院) College of Artificial Intelligence, Nankai University(南开大学人工智能学院)

AI总结 提出CosyEdit2,通过两阶段后训练框架(监督编辑初始化+基于目标语音无关数据的编辑导向GRPO)解决语音编辑与零样本TTS的局部声学一致性问题,显著提升编辑性能并增强零样本TTS能力。

详情
AI中文摘要

语音编辑和零样本文本到语音(TTS)共享基于语音提示的类似生成基础,但语音编辑对与周围未编辑内容的局部声学一致性要求严格得多。虽然先前工作表明监督微调(SFT)能使TTS模型获得功能性编辑能力,但该方法根本上受限于不完美的配对编辑数据和粗粒度的优化信号。为解决这些限制,我们提出CosyEdit2,一种构建于两阶段后训练框架上的语音编辑模型,该框架从监督编辑初始化逐步过渡到基于目标语音无关数据的编辑导向组相对策略优化(GRPO)。大量实验表明,CosyEdit2不仅显著提升了语音编辑性能,还解锁了更好的零样本TTS能力,揭示了两项任务之间更深层的相互关系。音频样本可在 https://cjy1018.github.io/CosyEdit2 获取。

英文摘要

Speech editing and zero-shot Text-to-Speech (TTS) share a similar generative foundation conditioned on speech prompts, yet speech editing demands far stricter local acoustic consistency with surrounding unedited content. While prior work has shown that Supervised Fine-Tuning (SFT) enables TTS models to acquire functional editing capability, this approach remains fundamentally bottlenecked by imperfect paired editing data and coarse-grained optimization signals. To address these limitations, we propose CosyEdit2, a speech editing model built on a two-stage post-training framework that progresses from supervised editing initialization to editing-oriented Group Relative Policy Optimization (GRPO) over target-speech-free data. Extensive experiments demonstrate that CosyEdit2 not only substantially advances speech editing performance, but also unlocks better zero-shot TTS capability, revealing a deeper mutual relationship between the two tasks. Audio samples are available at https://cjy1018.github.io/CosyEdit2.

2605.26244 2026-05-27 cs.CV cs.MM cs.SD 版本更新

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

LongAV-Compass:面向分钟级音视频生成在T2AV、I2AV和V2AV上的统一评估

Tengfei Liu, Yang Shi, Xuanyu Zhu, Jiafu Tang, Liu Yang, Qixun Wang, Zhuoran Zhang, Yuqi Tang, Fengxiang Wang, Yuhao Dong, Xinlong Chen, Bozhou Li, Bohan Zeng, Yue Ding, Xiaohan Zhang, Jialu Chen, Haotian Wang, Yuanxing Zhang, Pengfei Wan, Leye Wang

发表机构 * Peking University(北京大学) Kling Team(Kling团队) Nanjing University(南京大学) SJTU(上海交通大学) HKUST(GZ)(香港科技大学(广州)) Shanghai AI Lab(上海人工智能实验室) Nanyang Technological University(南洋理工大学) CASIA(中国科学院自动化所) Tsinghua University(清华大学)

AI总结 针对现有评估协议局限于短片段的问题,提出LongAV-Compass基准,通过284个测试用例和统一评估框架,系统评估分钟级音视频生成在文本、图像、视频条件下的质量、一致性和对齐。

详情
AI中文摘要

音视频生成正从短片段快速发展到分钟级内容,而现有评估协议仍主要局限于短片段设置。现有基准主要关注5-10秒的文本条件生成,很少支持跨文本、图像和视频条件模态的统一评估。此外,它们对身份一致性、叙事连贯性和音视频对齐在长时间跨度上的退化提供的洞察有限。为弥补这一差距,我们引入了LongAV-Compass,一个用于分钟级音视频生成的系统基准。LongAV-Compass包含284个精选测试用例,涵盖文本到音视频(T2AV)、图像到音视频(I2AV)和视频到音视频(V2AV),按应用场景和生成复杂度组织。该基准结合了基于分类法的基准构建和统一评估框架,该框架集成了MLLM辅助评估与互补的感知和多模态指标,包括DINO-v2、ArcFace、CLIP和ImageBind。该框架评估超过20个细粒度维度,涵盖片段内质量、跨片段一致性、全局叙事连贯性、语义对齐和音视频同步。通过对11个代表性模型的实验以及人类对齐验证,LongAV-Compass提供了一个诊断测试平台,用于分析当前系统在跨不同输入模态维持连贯、语义对齐和时间一致的分钟级音视频生成方面的局限性。

英文摘要

Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existing benchmarks primarily focus on 5--10 second text-conditioned generation and rarely support unified evaluation across text, image, and video conditioning modalities. Moreover, they provide limited insight into how identity consistency, narrative coherence, and audio-visual alignment degrade over extended temporal horizons. To bridge this gap, we introduce LongAV-Compass, a systematic benchmark for minute-long audio-visual generation. LongAV-Compass contains 284 curated test cases spanning text-to-audio-video (T2AV), image-to-audio-video (I2AV), and video-to-audio-video (V2AV), organized by application scenario and generation complexity. The benchmark combines taxonomy-guided benchmark construction with a unified evaluation framework that integrates MLLM-assisted assessment with complementary perceptual and multimodal metrics, including DINO-v2, ArcFace, CLIP, and ImageBind. The framework evaluates more than 20 fine-grained dimensions covering within-segment quality, cross-segment consistency, global narrative coherence, semantic alignment, and audio-visual synchronization. Through experiments on 11 representative models together with human-alignment validation, LongAV-Compass provides a diagnostic testbed for analyzing the limitations of current systems in sustaining coherent, semantically aligned, and temporally consistent minute-scale audio-visual generation across diverse input modalities.

2605.26176 2026-05-27 cs.SD cs.AI 版本更新

PitchBench: Measuring Pitch Hearing in Audio-Language Models

PitchBench: 测量音频-语言模型中的音高听觉能力

Milan Liessens Dujardin, Song-Ze Yu, Craver Corbyn Thomas-Smith, David M. Chan, Karina Nguyen

发表机构 * University of California, Berkeley(加州大学伯克利分校) Thoughtful Lab

AI总结 提出PitchBench评估套件,通过28个实验系统测量音频-语言模型在绝对和相对音高感知上的表现,发现当前模型在不同声源、音长和格式下音高感知不可靠。

Comments Preprint

详情
AI中文摘要

音频-语言模型(ALMs)越来越多地用于需要理解音乐的实际应用,从音乐辅导和转录到字幕、推荐系统和音乐制作。更广泛地说,它们正在成为多模态AI系统的重要组成部分,这些系统必须从感官输入而非仅文本进行推理。这使得可靠的音乐感知成为关键前提:如果模型无法准确听到声音的结构,就不能信任它来推理、教学、转录或对现实世界中的音频采取行动。然而,现有的基准测试很少评估这种感知背后最基本的音乐能力之一:音高听觉。当前的评估往往通过更高层次的任务间接探测音高听觉,且通常采用多项选择格式,这留下了ALMs在不同乐器、声学条件和响应格式下识别细粒度音高的可靠性问题。我们引入了PitchBench,一个系统测量ALMs音高听觉的评估套件。PitchBench包含28个实验,涵盖序列和和弦中的绝对和相对音高感知,同时变化响度、音符时长、声源、时间拉伸、背景噪声和其他声学条件。任务范围从识别孤立音高到在四声部音乐织体中跟踪旋律线。评估前沿ALMs,我们发现音高听觉仍然非常不可靠:模型在不同设置下表现持续不佳,准确率随声源、音符时长和记谱格式急剧变化。当前的ALMs尚未具备稳定的音高感知,即使对于受控的合成和乐器刺激也是如此。除了基准测试,我们还发布了PitchBench作为Python包,包含评估数据和数据生成工具,以支持未来关于音高感知音频-语言建模的工作。

英文摘要

Audio-language models (ALMs) are increasingly used in real-world applications that require understanding music, from music tutoring and transcription to captioning, recommendation systems, and music production. More broadly, they are becoming an important component of multimodal AI systems that must reason from sensory input rather than text alone. This makes reliable musical perception a critical prerequisite: if a model cannot accurately hear the structure of sound, it cannot be trusted to reason about, teach, transcribe, or act on audio in the real world. Yet existing benchmarks rarely assess one of the most fundamental musical abilities underlying such perception: pitch hearing. Current evaluations tend to probe pitch hearing only indirectly, through higher-level tasks and often in multiple-choice formats, leaving open how reliably ALMs identify fine-grained pitch across instruments, acoustic conditions, and response formats. We introduce PitchBench, an evaluation suite that systematically measures pitch hearing in ALMs. PitchBench comprises 28 experiments spanning absolute and relative pitch perception within sequences and chords, while varying loudness, note duration, sound source, time stretching, background noise, and other acoustic conditions. Tasks range from identifying individual pitches in isolation to tracking a melodic line within a four-part musical texture. Evaluating frontier ALMs, we find that pitch hearing remains highly unreliable: models perform consistently poorly across settings, with accuracy varying sharply by sound source, note duration, and notation format. Current ALMs do not yet possess stable pitch perception, even for controlled synthetic and instrumental stimuli. Alongside the benchmark, we release PitchBench as a Python package containing the evaluation data and data generation tools to support future work on pitch-aware audio-language modeling.

2605.26136 2026-05-27 cs.SD cs.AI 版本更新

Eroding Trust in Real Speech: A Large-Scale Study of Human Audio Deepfake Perception

侵蚀对真实语音的信任:人类音频深度伪造感知的大规模研究

Nicolas M. Müller, Wei Herng Choong

发表机构 * Fraunhofer AISEC(弗劳恩霍夫人工智能安全研究中心)

AI总结 通过大规模听辨实验(1768名参与者,35532次判断),发现音频深度伪造导致人类对真实语音的信任下降(准确率从72.7%降至64.1%),而非检测伪造能力下降。

详情
AI中文摘要

音频深度伪造近期发展迅速,但其对人类信任真实语音的影响尚未被研究。我们进行了迄今为止最大规模的音频深度伪造感知听辨研究,收集了来自1768名参与者对138个文本转语音和语音转换系统的35532次判断。我们的核心发现是怀疑偏移:与2021年的基线相比,人类对伪造样本的准确率几乎没有变化(72.9%降至71.2%),但对真实样本的准确率从72.7%降至64.1%。参与者并非更难以检测合成伪影,而是越来越不信任真实的语音。由商业和自回归语言模型系统生成的样本最难检测(61.3-65.9%),而传统seq2seq和流匹配模型生成的样本仍然较易识别(75.4-76.8%)。作为参考的机器学习检测器在所有条件下保持超过94.5%的准确率。我们的结果表明,现代深度伪造的主要威胁可能不仅仅是欺骗,而是对真实语音信任的侵蚀。

英文摘要

Audio deepfakes have improved rapidly recently, yet their effect on human trust in real speech remains unstudied. We present the largest listening study on audio deepfake perception to date, collecting 35,532 judgments from 1,768 participants across 138 text-to-speech and voice conversion systems. Our central finding is a skepticism shift: compared to a 2021 baseline, human accuracy on fake samples barely changed (72.9% to 71.2%), but accuracy on real samples dropped from 72.7% to 64.1%. Participants are not worse at detecting synthesis artifacts; rather, they increasingly distrust authentic speech. Samples generated by commercial and autoregressive language model systems proved hardest to detect (61.3 - 65.9%), while those from traditional seq2seq and flow-matching models remain easier to spot (75.4 - 76.8%). An ML detector that served as a reference point maintained over 94.5% accuracy across all conditions. Our results suggest that the primary threat posed by modern deepfakes may not be mere deception, but the erosion of trust in genuine audio.

2605.03929 2026-05-27 cs.SD cs.AI cs.LG eess.SP 版本更新

PHALAR: Phasors for Learned Musical Audio Representations

PHALAR:用于学习音乐音频表示的相量

Davide Marincione, Michele Mancusi, Giorgio Strano, Luca Cerovaz, Donato Crisostomi, Roberto Ribuoli, Emanuele Rodolà

发表机构 * Department of Computer Science, Sapienza University of Rome, Italy(罗马大学计算机科学系) Moises Systems, Inc.(Moises系统公司) Paradigma, Inc.(Paradigma公司)

AI总结 提出PHALAR对比框架,利用学习谱池化和复值头实现音高和相位等变,在茎检索任务中参数减少50%、训练加速7倍,准确率相对提升约70%,并捕获鲁棒的音乐结构。

Comments Accepted at ICML 2026

详情
AI中文摘要

茎检索,即匹配缺失茎到给定音频子混音的任务,是一个关键挑战,目前受限于丢弃时间信息的模型。我们引入PHALAR,一个对比框架,在参数少于50%且训练加速7倍的情况下,相对于现有技术实现了高达约70%的相对准确率提升。通过利用学习谱池化层和复值头,PHALAR强制施加音高等变和相位等变偏差。PHALAR在MoisesDB、Slakh和ChocoChorales上建立了新的检索最优结果,与人类一致性判断的相关性显著高于语义基线。最后,零样本节拍跟踪和线性和弦探测证实PHALAR捕获了超越检索任务的鲁棒音乐结构。

英文摘要

Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to $\approx 70\%$ over the state-of-the-art while requiring $<50\%$ of the parameters and a 7$\times$ training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-equivariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.

2601.18904 2026-05-27 cs.SD cs.AI cs.CL 版本更新

MetaSICL: Adapting Audiroty LLM via Meta Speech In-Context Learning

MetaSICL: 通过元语音上下文学习适应听觉大语言模型

Haolong Zheng, Siyin Wang, Zengrui Jin, Mark Hasegawa-Johnson

发表机构 * University of Illinois Urbana Champaign(伊利诺伊大学厄巴纳-香槟分校) Tsinghua University(清华大学)

AI总结 提出MetaSICL方法,利用高资源语音数据通过元学习增强听觉大语言模型的上下文学习能力,在低资源场景下优于直接微调。

详情
AI中文摘要

听觉大语言模型在广泛的语音和音频理解任务中表现出强大的性能。然而,当应用于低资源任务时,它们常常遇到困难。如果域内标注数据稀缺或与真实测试分布不匹配,直接微调可能不稳定。上下文学习通过基于少量域内示例的条件化来适应听觉大语言模型,提供了一种无需训练、推理时的解决方案。在这项工作中,我们首先表明,$ extit{Vanilla ICL}$ 在选定的模型上提高了跨多种语音和音频任务的零样本性能,这表明这种ICL适应能力可以推广到多模态设置。在此基础上,我们提出了$ extbf{Meta Speech In-Context Learning (MetaSICL)}$,这是一种后训练方法,仅利用来自各种任务的高资源语音数据,旨在增强模型的上下文学习能力。实验表明,我们提出的方法在低资源场景下优于直接微调。

英文摘要

Auditory Large Language Models (LLMs) have demonstrated strong performance across a wide range of speech and audio understanding tasks. Nevertheless, they often struggle when applied to low-resource tasks. In case in-domain labeled data are scarce or mismatched with the true test distribution, direct fine-tuning can be brittle. In-Context Learning (ICL) provides a training-free, inference-time solution by adapting auditory LLMs through conditioning on a few in-domain demonstrations. In this work, we first show that $\textit{Vanilla ICL}$, improves zero-shot performance across diverse speech and audio tasks for selected models which suggest that this ICL adaptation capability can be generalized to multimodal setting. Building on this, we propose $\textbf{Meta Speech In-Context Learning (MetaSICL)}$, a post-training recipe utilizes only high resource speech data from various tasks intending to strengthen model's in-context learning capability. Experiments indicate our proposed method outperforms direct fine-tuning in low-resource scenario.

2510.10774 2026-05-27 cs.SD cs.AI cs.HC cs.LG 版本更新

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

ParsVoice: 面向文本到语音合成的大规模多说话人波斯语语音语料库

Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery

发表机构 * School of Electrical and Computer Engineering, University of Tehran(塔里哈大学电气与计算机工程学院) Institute for Research in Fundamental Sciences (IPM)(基础科学研究所(IPM))

AI总结 提出ParsVoice,目前最大的公开波斯语语音-文本语料库,通过可扩展的流水线从长篇有声读物构建高质量数据,用于训练多说话人TTS系统,并验证了其在零样本多说话人TTS中的有效性。

详情
AI中文摘要

波斯语在开放的语音-文本资源中仍然严重不足,限制了多说话人文本到语音(TTS)、语音语言建模和低资源语音处理的进展。我们介绍了ParsVoice,这是目前最大的公开波斯语语音-文本语料库,专为训练多说话人TTS系统而设计,同时提供了一个可扩展的流水线,用于从长篇有声读物录音中构建高质量的语音-文本数据。该流水线结合了微调的ParsBERT句子补全分类器、基于ASR的边界优化、标点恢复、说话人识别以及涵盖音频和波斯语特定文本属性的多维质量评估。最终发布的版本包含一个2200小时的TTS就绪子集,包含来自1815个自动识别说话人ID的136万个对齐片段,比之前最大的公开波斯语TTS数据集大25倍以上。为了验证该语料库,我们微调了XTTS,一个直接操作原始波斯语文本(无需音素表示)的零样本多语言TTS模型,实现了自然度MOS为3.6/5,说话人相似度MOS为4.0/5。ParsVoice数据集公开在:https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice。

英文摘要

Persian remains substantially underrepresented in open speech-text resources, limiting progress in multi-speaker text-to-speech (TTS), speech-language modelling, and low-resource speech processing. We introduce ParsVoice, the largest publicly available Persian speech-text corpus tailored for training multi-speaker TTS systems, along with a scalable pipeline to construct high-quality speech-text data from long-form audiobook recordings. The pipeline combines a fine-tuned ParsBERT sentence-completion classifier, ASR-based boundary optimization, punctuation restoration, speaker identification, and a multi-dimensional quality assessment that covers both audio and Persian-specific text properties. The resulting release contains a 2,200-hour TTS-ready subset with 1.36 million aligned segments from 1,815 automatically identified speaker IDs, making it more than 25 times larger than the previously largest open Persian TTS dataset. To validate the corpus, we fine-tune XTTS, a zero-shot multilingual TTS model that operates directly on raw Persian text without phoneme representations, achieving a naturalness MOS of 3.6/5 and speaker similarity MOS of 4.0/5. The ParsVoice dataset is publicly available at: https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice.

2506.10225 2026-05-27 cs.SD cs.AI eess.AS 版本更新

Genre Controlled Music Generation via Activation Steering

通过激活引导实现体裁控制的音乐生成

Swathi Narashiman, Pranay Mathur, Dipanshu Panda, Jayden Koshy Joe, Harshith M R, Anish Veerakumar, Aniruddh Krishna, Keerthiharan A

发表机构 * Indian Institute of Technology Madras(印度理工学院马德拉斯学院)

AI总结 提出一种在推理时对自回归生成模型MusicGen进行干预的方法,利用线性探针权重引导残差流,实现细粒度的体裁控制。

详情
AI中文摘要

计算音乐生成正朝着非传统风格发展,需要能够精确且可控地融合不同音乐元素的方法。在这项工作中,我们提出了一种方法,通过对自回归生成变换器MusicGen进行推理时干预来实现细粒度控制。通过我们的方法,我们利用线性探针在残差流上的权重来引导残差流,从而实现体裁控制。通过将激活引导视为一种人类可控的交互,我们的工作突出了可解释的模型行为如何在协同创作的音乐生成中发挥作用。展示我们方法的音频样本可在我们的演示页面上找到。

英文摘要

Computational Music Generation is evolving towards non-conventional styles, demanding methods that enable precise and controllable blending of diverse music elements. In this work, we present a method for fine grained control using inference-time interventions on an autoregressive generative transformer, MusicGen. Through our approach, we achieve genre control by steering the residual stream using weights of a linear probe on it. By framing activation steering as a human-controllable interaction, our work highlights how interpretable model behaviors can empower in co-creative music generation.Audio samples demonstrating our method are available on our demo page.