arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.21433 2026-05-21 cs.SD 版本更新

Instrumental Text-to-Music Generation with Auxiliary Conditioning Branches

通过辅助条件分支进行乐器文生成

Junyoung Koh

发表机构 * Department of Artificial Intelligence(人工智能系) Yonsei University(延世大学) MAAP KRAFTON Seoul, Republic of Korea(韩国首尔)

AI总结 本文研究了在无外部预训练的情况下,通过控制数据和预训练来隔离有效设计选择的问题,发现去除辅助分支的模型在多个评估指标上表现较差,而增加DiT深度只能小幅恢复性能,表明辅助分支可能在训练时起到架构锚定作用。

Comments ICME 2026 Grand Challenge on Academic Text-to-Music Generation

详情
AI中文摘要

文本到音乐生成已经取得了快速进展,现代自回归和扩散模型能够从自然语言提示生成逼真的音乐。然而,大部分进展依赖于大规模训练数据和外部预训练,使得在控制数据和预训练的情况下难以确定哪些设计选择仍然有效。我们使用扩散变压器主干网络,结合歌词和音色条件,针对仅乐器的文本到音乐任务进行了调整,在此任务中辅助的歌词和音色分支仅接收退化条件信号。通过受控消融分析,我们发现没有这些分支重新训练的模型在AudioBox美学、LLM-as-judge和人类MOS评分上得分较低,而将节省的参数作为额外的DiT深度重新投资只能略微恢复性能。这表明辅助分支可能在训练时起到架构锚定作用,其贡献超出了其显式条件内容。我们通过与外部乐器基线的比较以及通过我们的ICME 2026学术文本到音乐(ATTM)大奖挑战提交进行了验证,在该挑战中,我们的Performance提交在客观指标和随后组织者管理的MOS评分上均排名第一,获得所有提交中最高的总体MOS评分,而我们的Efficiency提交是决赛选手,以客观指标第二名的成绩并列。

英文摘要

Text-to-music generation has advanced rapidly, with modern autoregressive and diffusion-based models producing convincing music from natural-language prompts. However, much of this progress relies on large-scale training data and external pretraining, making it difficult to isolate which design choices remain effective when data and pretraining are controlled. We study this setting using a Diffusion Transformer backbone with lyric and timbre conditioning, adapted to an instrumental-only text-to-music task in which the auxiliary lyric and timbre branches receive only degenerate conditioning signals. Through controlled ablations, we find that models retrained without these branches score lower across AudioBox aesthetics, LLM-as-judge, and human MOS, and that reinvesting the saved parameters as additional DiT depth recovers only marginally. This suggests the auxiliary branches may act as training-time architectural anchors whose contribution goes beyond their explicit conditioning content. We validate the same model through comparisons with external instrumental baselines and through our submission to the ICME 2026 Academic Text-to-Music (ATTM) Grand Challenge, where our Performance submission ranked first under both the objective metrics and the subsequent organizer-administered MOS over 35 raters, attaining the highest overall MOS across all challenge submissions, while our Efficiency submission was a finalist that tied for second under the objective metrics.

2601.06006 2026-05-21 eess.AS cs.SD 版本更新

Discriminative-Generative Target Speaker Extraction with Decoder-Only Language Models

判别-生成目标说话人提取与解码器-only语言模型

Bang Zeng, Beilong Tang, Wang Xiang, Ming Li

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) Suzhou Municipal Key Laboratory of Multimodal Intelligent Systems, Digital Innovation Research Center, Duke Kunshan University(多模态智能系统苏州市重点实验室、数字创新研究中心、杜克昆山大学) North Carolina State University(北卡罗来纳州立大学)

AI总结 本文提出了一种判别-生成两阶段框架,结合判别提取的可控性和生成模型的重建能力,以提高目标说话人提取和语音增强的感知质量、可懂度和说话人一致性。

Comments 13 pages,4 figures

详情
AI中文摘要

目标说话人提取(TSE)旨在从混合信号中恢复目标说话人的语音,给定一个短的注册语句,而语音增强(SE)则聚焦于在噪声条件下提高语音质量。大多数现有的TSE和SE系统基于判别建模,表现出强大的干扰抑制能力,但往往在感知质量和自然度上有限。为了解决这个问题,我们首先引入LauraTSE,一种基于自回归解码器-only语言模型的生成TSE模型。尽管生成建模在质量增强方面很有前景,但纯粹的生成TSE可能会在复杂的声学条件下遇到幻觉、内容漂移和可控性有限的问题。因此,我们提出了一种判别-生成两阶段框架,其中判别前端首先生成具有强干扰抑制能力的目标相关表示,然后生成后端在神经音频编码器表示空间中重建高质量语音。这种设计结合了判别提取的可控性和生成建模的重建能力。我们进一步研究了该两阶段框架的几种协作策略,包括前端冻结、联合微调、SI-SDR正则化以及自回归/非自回归推理。在TSE和SE基准测试中,实验结果表明,所提出的框架在感知质量、可懂度和说话人一致性之间实现了更好的平衡,优于纯判别或纯生成基线。

英文摘要

Target speaker extraction (TSE) aims to recover the speech of a desired speaker from a mixture given a short enrollment utterance, while speech enhancement (SE) focuses on improving speech quality under noisy conditions. Most existing TSE and SE systems are based on discriminative modeling and have shown strong interference suppression ability, but they often remain limited in perceptual quality and naturalness. To address this issue, we first introduce LauraTSE, a generative TSE model built on an autoregressive decoder-only language model. Although generative modeling is promising for quality enhancement, purely generative TSE may suffer from hallucination, content drift, and limited controllability in complex acoustic conditions. We therefore propose a discriminative-generative two-stage framework, where a discriminative front-end first produces target-related representations with strong interference suppression, and a generative back-end then reconstructs high-quality speech in the neural audio codec representation space. This design combines the controllability of discriminative extraction with the reconstruction capability of generative modeling. We further investigate several collaboration strategies for the two-stage framework, including front-end freezing, joint fine-tuning, SI-SDR regularization, and autoregressive/non-autoregressive inference. Experimental results on both TSE and SE benchmarks show that the proposed framework achieves a better balance among perceptual quality, intelligibility, and speaker consistency than purely discriminative or purely generative baselines.

2605.21081 2026-05-21 cs.SD cs.LG 版本更新

Musical Attention Transformer: Music Generation Using a Music-Specific Attention Model

音乐注意力转换器:使用音乐特定的注意力模型进行音乐生成

Shinnosuke Taksuka, Hideo Mukai

发表机构 * Department of Computer Science, School of Science and Technology, Meiji University(计算机科学系,科学与技术学部,立命馆大学)

AI总结 本文提出了一种音乐特定的注意力模型,通过整合元信息来提升音乐生成的质量,核心方法是将音乐结构和元数据结合,主要贡献是提高了生成音乐的连贯性和多样性。

Comments 32 pages, 13 figures

详情
AI中文摘要

本研究旨在通过引入元信息来提升使用Transformer进行音乐生成的质量。尽管基于Transformer的方法在捕捉音乐作品中的长期依赖性方面有效,但它们生成的音乐常出现重复或音符重复的问题,导致不自然的旋律。为了解决这些限制,我们提出了音乐注意力机制,该机制将元信息如小节号、调性、节拍等整合到注意力过程中。音乐注意力显式利用音乐的结构属性及其相关元数据,使Transformer的注意力机制能够更有效地运作,从而提高生成输出的质量。在我们的框架中,每个音乐音符被表示为五个事件(音高、小节号、起始时间、持续时间和力度)以及三个元数据元素的组合。然后将注意力机制修改为反映这些八个特征之间的相关性,使模型能够更好地捕捉音乐编排的内在特性。实验结果表明,整合音乐注意力的模型在音乐连贯性、变化性和整体质量方面优于先前的方法,如全注意力和步进注意力。值得注意的是,它显著减少了重复并增强了模型生成多样化、和谐一致的旋律的能力。音乐注意力因此在AI驱动的音乐生成中代表了重要的进展,有助于创建更自然和富有表现力的音乐作品。

英文摘要

This study aims to enhance the quality of music generation using Transformers by incorporating meta-information. While Transformer-based approaches are effective at capturing long-term dependencies in musical compositions, the music they generate often suffers from issues such as excessive repetition or duplication of notes, leading to unnatural melodies. To address these limitations, we propose Musical Attention, a mechanism that incorporates meta-information such as bar numbers, key, signatures, and tempos into the attention process. Musical Attention explicitly leverages both the structural properties of music and its associated metadata, enabling the Transformer's attention mechanism to operate more effectively and thereby improving the quality of the generated output. In our framework, each musical note is represented as a combination of five events-pitch, bar number, onset, duration, and velocity in addition to the three metadata elements. The attention mechanism is then modified to reflect the correlations among these eight features, allowing the model to better capture the inherent characteristics of musical composition. Experimental results demonstrate that the model incorporating Musical Attention outperforms prior methods, such as Full Attention and Strided Attention, in terms of musical coherence, variation, and overall quality. Notably, it significantly reduces repetition and enhances the model's ability to generate diverse, harmonically consistent melodies. Musical Attention thus represents a meaningful advancement in AI-driven music generation, facilitating the creation of more natural and expressive compositions.

2605.20920 2026-05-21 cs.CL cs.SD 版本更新

Evaluating Speech Articulation Synthesis with Articulatory Phoneme Recognition

通过发声学音素识别评估语音发声合成

Vinicius Ribeiro, Yves Laprie

发表机构 * Université de Lorraine, CNRS, Inria, LORIA(洛林大学、国家科学研究中心、法国国家信息与自动化研究所、LORIA)

AI总结 本文通过发声学音素识别作为代理来评估语音发声合成的质量,提出利用发声学特征进行音素识别以更准确捕捉发音细节,从而改进生成模型的评估方法。

Comments Accepted for publication at the European Signal Processing Conference (EUSIPCO), 2026

详情
AI中文摘要

最近机器学习的进步和发声学数据集的可用性使得声带合成可以基于语音序列进行条件化,这是发声学语音合成的主要任务。然而,质量评估需要更好的定义。通常,对生成模型进行排名具有挑战性,因为这涉及主观性。然而,发声学合成还具有额外的困难,即需要对声带解剖学和声学有专业知识。为了解决这个问题,本文提出通过音素识别来评估语音发声合成。我们的假设是使用发声学特征进行音素识别能更好地捕捉发音细节,如正确的发音位置,这传统度量(如点距度量)无法做到。我们训练了一个神经网络,使用来自单说话人RT-MRI数据集提取的声学和发声学特征。然后,我们比较了在不同合成发声学特征下测试模型的识别性能。我们的结果表明,我们的发声学特征集在语音发声合成中具有丰富的语音信息,并有助于探索额外的维度。

英文摘要

Recent advances in machine learning and the availability of articulatory datasets allow vocal tract synthesis to be conditioned on phonetic sequences, a primary task of articulatory speech synthesis. However, quality assessment needs a better definition. Generally, ranking generative models is tricky due to subjectivity. However, articulatory synthesis has the additional difficulty of requiring specialized knowledge in vocal tract anatomy and acoustics. To address this problem, this paper proposes to evaluate speech articulation synthesis using phoneme recognition as a proxy. Our hypothesis is that phoneme recognition using articulatory features better captures nuances in phoneme production, such as correct places of articulation, which traditional metrics (e.g., point-wise distance metrics) do not. We train a neural network with acoustic and articulatory features extracted from a single-speaker RT-MRI dataset. Then, we compare the recognition performance when testing the model with different synthetic articulatory features. Our results show that our articulatory feature set is phonetically rich and helps exploring additional dimensions on speech articulation synthesis.

2605.20853 2026-05-21 cs.SD eess.AS 版本更新

SEABAD: A Tropical Bird Activity Detection Dataset for Passive Acoustic Monitoring

SEABAD:一种用于被动声学监测的热带鸟类活动检测数据集

Muhammad Mun'im Ahmad Zabidi, Mohd Yamani Idna Idris, Norisma Idris

发表机构 * Faculty of Computer Science and Information Technology, Universiti Malaya(马来亚大学计算机科学与信息技术学院) Faculty of Electrical Engineering, Universiti Teknologi Malaysia(马来西亚科技大学电气工程学院)

AI总结 本文提出SEABAD数据集,用于解决热带地区鸟类活动检测中物种丰富和声学复杂性带来的挑战,通过平衡的鸟类存在和不存在样本以及标准化音频格式,支持高效的声学监测和低功耗推理。

Comments 14 pages, 4 figures

详情
AI中文摘要

被动声学监测(PAM)能够实现大规模生物多样性评估,但连续录音会产生大量非信息性音频,给存储、能耗和长期边缘部署带来挑战。鸟类音频检测(BAD)通过在下游分析前过滤无关录音来减轻这一负担。然而,大多数BAD系统是在温带数据集上训练的,尽管热带声音景观更密集、物种更丰富且声学不可预测。为了解决这一差距,我们引入了SEABAD(东南亚鸟类活动检测),包含50,000个经过精心挑选的三秒剪辑,平衡鸟类存在和不存在的样本。该数据集涵盖1,677个鸟类物种,并标准化为16 kHz单声道音频以支持嵌入式和低功耗推理。我们开发了双分支编目流程:一个六阶段正标签工作流应用于Xeno-Canto录音,以及六个来源特定的负标签提取从环境数据集中。这些程序将类别不平衡降低了13.7%(基尼系数:0.601到0.519)。对1,000个正样本的手动审核确认了97.8%±0.9%的标注准确性。使用MobileNetV3-Small的基线实验在三个随机种子上实现了99.57%±0.25%的准确率和0.9985±0.0002的AUC。SEABAD和完整的编目流程已公开发布,以支持热带BAD研究和节能声学监测。

英文摘要

Passive acoustic monitoring (PAM) enables large-scale biodiversity assessment, but continuous recording generates large amounts of non-informative audio, creating challenges for storage, power consumption, and long-term edge deployment. Bird audio detection (BAD), which identifies bird vocalizations, can reduce this burden by filtering irrelevant recordings before downstream analysis. However, most BAD systems are trained on temperate datasets despite tropical soundscapes being denser, more species-rich, and acoustically unpredictable. To address this gap, we introduce SEABAD (Southeast Asian Bird Activity Detection), a dataset of 50,000 curated three-second clips from Southeast Asian soundscapes, evenly balanced between bird-present and bird-absent samples. The dataset spans 1,677 bird species and is standardized to 16 kHz mono audio for embedded and low-power inference. We developed a dual-branch curation pipeline: a six-stage positive-label workflow applied to Xeno-Canto recordings, alongside six source-specific negative-label extractions from environmental datasets. These procedures reduced class imbalance by 13.7% (Gini coefficient: 0.601 to 0.519). A manual audit of 1,000 positive clips confirmed 97.8% +/- 0.9% labeling accuracy. Baseline experiments using MobileNetV3-Small achieved 99.57% +/- 0.25% accuracy and 0.9985 +/- 0.0002 AUC across three random seeds. SEABAD and the full curation pipeline are publicly released to support tropical BAD research and energy-efficient acoustic monitoring.

2605.11866 2026-05-21 cs.SD 版本更新

AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling

AuDirector:一种用于沉浸式音频叙事的自反思闭环框架

Yiming Ren, Xuenan Xu, Ziyang Zhang, Wen Wu, Baoxiang Li, Chao Zhang

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Tsinghua University(清华大学)

AI总结 本文提出AuDirector框架,通过自反思闭环多智能体方法解决长期音频叙事中一致性、情感表达和音频保真度的问题,提升语音生成的质量和用户交互性。

详情
AI中文摘要

尽管在文本和视觉生成方面取得了进展,但创建连贯的长格式音频叙事仍然具有挑战性。现有框架往往存在角色设定与语音表现不匹配、自我纠正机制不足和人机交互有限等问题。为了解决这些挑战,我们提出AuDirector,一种自反思闭环多智能体框架。具体而言,它包括一个身份感知预制作机制,将叙事文本转换为角色档案和语句层面的情感指令,以检索合适的语音候选人并指导表达性语音合成,从而促进上下文对齐的语音适应。为了提高质量,协作合成与纠正模块引入闭环自我纠正机制,系统地审核和重新生成缺陷的音频组件。此外,由人类引导的交互细化模块通过解释自然语言反馈来促进用户控制,从而交互式地细化底层脚本。实验表明,AuDirector在结构连贯性、情感表达性和音频保真度方面均优于最先进的基线模型。音频样本可在https://anonymous-itsh.github.io/上找到。

英文摘要

Despite advances in text and visual generation, creating coherent long-form audio narratives remains challenging. Existing frameworks often exhibit limitations such as mismatched character settings with voice performance, insufficient self-correction mechanisms, and limited human interactivity. To address these challenges, we propose AuDirector, a self-reflective closed-loop multi-agent framework. Specifically, it involves an Identity-Aware Pre-production mechanism that transforms narrative texts into character profiles and utterance-level emotional instructions to retrieve suitable voice candidates and guide expressive speech synthesis, thereby promoting context-aligned voice adaptation. To enhance quality, a Collaborative Synthesis and Correction module introduces a closed-loop self-correction mechanism to systematically audit and regenerate defective audio components. Furthermore, a Human-Guided Interactive Refinement module facilitates user control by interpreting natural language feedback to interactively refine the underlying scripts. Experiments demonstrate that AuDirector achieves superior performance compared to state-of-the-art baselines in structural coherence, emotional expressiveness, and acoustic fidelity. Audio samples can be found at https://anonymous-itsh.github.io/.

2602.16399 2026-05-21 eess.AS cs.LG cs.SD 版本更新

Multi-Channel Replay Speech Detection using Acoustic Maps

基于声学地图的多通道回放语音检测

Michael Neri, Tuomas Virtanen

发表机构 * Faculty of Information Technology(信息科技学院) Commmunication Sciences(通信科学) Tampere University(塔尔皮奥大学) Tampere, Finland(芬兰塔尔皮奥)

AI总结 本文提出利用声学地图作为新型空间特征表示方法,用于多通道录音中的回放语音检测,通过轻量级卷积神经网络在ReMASC数据集上实现了竞争性性能,展示了声学地图在不同设备和声学环境下的紧凑且物理可解释的特征空间。

Comments Accepted in EUSIPCO 2026

详情
AI中文摘要

回放攻击仍然是自动说话人验证系统中的关键漏洞,特别是在实时语音助手应用中。在本工作中,我们提出声学地图作为新型的空间特征表示方法,用于从多通道录音中检测回放语音。声学地图源自经典波束成形在离散方位和仰角网格上的处理,编码方向能量分布,反映了人类语音辐射与基于扬声器的回放之间的物理差异。设计了一个轻量级卷积神经网络来操作此表示,在ReMASC数据集上约有6000个可训练参数。实验结果表明,声学地图为回放攻击检测提供了紧凑且物理可解释的特征空间,适用于不同设备和声学环境。

英文摘要

Replay attacks remain a critical vulnerability for automatic speaker verification systems, particularly in real-time voice assistant applications. In this work, we propose acoustic maps as a novel spatial feature representation for replay speech detection from multi-channel recordings. Derived from classical beamforming over discrete azimuth and elevation grids, acoustic maps encode directional energy distributions that reflect physical differences between human speech radiation and loudspeaker-based replay. A lightweight convolutional neural network is designed to operate on this representation, achieving competitive performance on the ReMASC dataset with approximately 6k trainable parameters. Experimental results show that acoustic maps provide a compact and physically interpretable feature space for replay attack detection across different devices and acoustic environments.

2110.06123 2026-05-21 cs.SD eess.AS 版本更新

COVID-19 Diagnosis from Cough Acoustics using ConvNets and Data Augmentation

通过卷积神经网络和数据增强进行新冠肺炎咳嗽声诊断

Saranga Kingkor Mahanta, Darsh Kaushik, Shubham Jain, Hoang Van Truong, Koushik Guha

发表机构 * Electronics and Communication Engineering Department, National Institute of Technology, Silchar, India 788010(电子与通信工程系,印度尼特理工学院,西尔CHAR,印度788010) Computer Science and Engineering Department, National Institute of Technology,Silchar, India 788010(计算机科学与工程系,印度尼特理工学院,西尔CHAR,印度788010) Software Engineering Department, Paytm, Noida, India 110096(软件工程系,Paytm,印度诺伊达,印度110096) Mathematics in Computer Science, University of Science of HCMC, Ho Chi Minh City, Vietnam 700000(计算机科学中的数学,河内科学大学,胡志明市,越南700000)

AI总结 本文提出利用卷积神经网络和数据增强技术,对DiCOVA 2021挑战赛Track 1中的咳嗽声数据集进行分析,以实现新冠肺炎的诊断,通过改进模型在盲测集上的AUC分数达到87.07,并超越了挑战赛的基线模型。

Comments DiCOVA, top 1st, This work has been submitted to the IEEE for possible publication

详情
Journal ref
IEEE Advances in Computing and Future Communication Technologies (ICACFCT), Meerut, India, 2021, pp. 33-38
AI中文摘要

随着新冠肺炎的周期性上升和下降,各国正遭受其波浪的冲击,因此需要一种高效、经济且简便的病毒诊断方法。新冠肺炎阳性个体可能甚至无症状,使诊断变得困难,但其中无症状者并不完全没有病毒引起的症状。他们可能不会表现出任何可观察的症状,如有症状者,但他们在咳嗽方式上可能与未感染者不同。这些咳嗽声音的差异是微小的,难以被人类耳朵察觉,但可以使用基于机器学习的统计模型捕捉到。在本文中,我们提出了一种深度学习方法来分析DiCOVA 2021挑战赛Track 1中的声音数据集,该数据集包含属于新冠肺炎阳性或阴性示例的咳嗽声音记录。为了将声音记录分类为新冠肺炎阳性或阴性示例,我们提出了一个ConvNet模型。我们的模型在提供的盲测集上实现了72.23%的AUC分数,以进行模型的无偏评估。结合数据增强的ConvNet模型进一步将AUC-ROC百分比从72.23增加到87.07。它还比DiCOVA 2021挑战赛的基线模型高出23%,从而在挑战赛排行榜上占据首位。本文提出将梅尔频率倒谱系数作为所提模型的特征输入。

英文摘要

With the periodic rise and fall of COVID-19 and countries being inflicted by its waves, an efficient, economic, and effortless diagnosis procedure for the virus has been the utmost need of the hour. COVID-19 positive individuals may even be asymptomatic making the diagnosis difficult, but amongst the infected subjects, the asymptomatic ones need not be entirely free of symptoms caused by the virus. They might not show any observable symptoms like the symptomatic subjects, but they may differ from uninfected ones in the way they cough. These differences in the coughing sounds are minute and indiscernible to the human ear, however, these can be captured using machine learning-based statistical models. In this paper, we present a deep learning approach to analyze the acoustic dataset provided in Track 1 of the DiCOVA 2021 Challenge containing cough sound recordings belonging to both COVID-19 positive and negative examples. To perform the classification on the sound recordings as belonging to a COVID-19 positive or negative examples, we propose a ConvNet model. Our model achieved an AUC score percentage of 72.23 on the blind test set provided by the same for an unbiased evaluation of the models. The ConvNet model incorporated with Data Augmentation further increased the AUC-ROC percentage from 72.23 to 87.07. It also outperformed the DiCOVA 2021 Challenge's baseline model by 23% thus, claiming the top position on the DiCOVA 2021 Challenge leaderboard. This paper proposes the use of Mel frequency cepstral coefficients as the feature input for the proposed model.

2605.20386 2026-05-21 cs.MM cs.CY cs.HC cs.SD 版本更新

Music of Changing Lines: Toward a Culturally Situated Approach to the I-Ching

变化之音:走向一种文化情境下的易经研究

Ling Qi, Aleksandra Teng Ma, Alexandria Smith

发表机构 * German mathematician philosopher Gottfried Wilhelm Leibniz(德国数学家哲学家戈特弗里德·威廉·莱布尼茨) John Cage(约翰·凯奇) Christian Wolff(克里斯蒂安·沃尔夫) team2023gemini(团队2023年Gemini) comanici2025gemini(科曼尼2025年Gemini) team2025live(团队2025年Live)

AI总结 本文提出一种文化情境下的易经解读方法,通过交互系统将易经作为意义生成框架,利用概率音乐过程实时生成六爻和变爻,并通过大语言模型Gemini进行解释,再转化为生成音乐模型Lyria的提示,产生响应性的音乐实现。

Comments Published and presented at the International Computer Music Conference (ICMC) 2026

详情
AI中文摘要

易经是中国思想史中最具影响力的文本之一,融合占卜、宇宙学和伦理反思。虽然西方实验音乐,特别是约翰·凯奇,曾将易经作为偶然操作的灵感来源,但这些引用往往脱离了赋予文本意义的解释和哲学过程。本文《变化之音》提出一个交互系统,使易经重新成为承载意义的框架,而非中性的随机生成器。用户执行文王法硬币投掷,该过程通过概率音乐过程实时进行。生成的六爻和变爻由大语言模型Gemini根据用户的查询进行解释。这种文本解释随后转化为生成音乐模型Lyria的提示,产生响应性的音乐实现。通过将AI定位为解释中介而非创作权威,系统突显了易经的仪式、解释和参与作为主要的音响材料。《变化之音》通过展示生成AI如何支持参与性、意义驱动的音乐过程,而无需规定音乐结构或取代人类主动性,扩展了计算机音乐的过程驱动传统。

英文摘要

The I-Ching is one of the most influential texts in Chinese intellectual history, integrating divination, cosmology, and ethical reflection. While Western experimental music, most notably John Cage, has drawn on the I-Ching as a source of chance operation, such appropriations have often detached its formal mechanisms from the interpretive and philosophical processes that give the text meaning. This work, Music of Changing Lines, presents an interactive system that re-centers the I-Ching as a meaning-bearing framework rather than a neutral randomizer. Users perform Wen Wang Fa coin casting, which is accompanied in real time through probabilistic musical processes. The resulting hexagrams and changing lines are interpreted by a large language model, Gemini, in relation to the user's inquiry. This textual interpretation is then translated into a prompt for a generative music model, Lyria, producing a responsive musical realization. By situating AI as an interpretive intermediary rather than a compositional authority, the system foregrounds the I-Ching's ritual, interpretation, and participation as the primary sonic materials. Music of Changing Lines extends process-driven traditions in computer music by demonstrating how generative AI can support participatory, meaning-driven musical processes without prescribing musical structure or replacing human agency.

2605.20356 2026-05-21 cs.CL cs.AI cs.SD 版本更新

Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models

全双工语音对话模型中的同步与轮流机制

Pablo Riera, Pablo Brusco, Cristina Kuo, Marcelo Sancinetti, S. R. K. Branavan

发表机构 * ASAPP Inc.(ASAPP公司) Departamento de Computación, FCEyN, Universidad de Buenos Aires(计算机系,福克雷斯-恩分校,布宜诺斯艾利斯大学)

AI总结 本文研究了全双工语音对话模型如何通过内部表示的同步协调交互,并发现噪声条件下同步性下降,内部状态编码了提前的轮流预测信息。

详情
AI中文摘要

全双工语音对话模型(SDMs)能够同时听和说,使交互动态更接近人类对话。受人类沟通中神经耦合的启发,我们研究了此类模型在交互过程中如何协调其内部表示。我们模拟了两个预训练Moshi模型在受控条件下的全双工对话,操纵信道噪声和解码偏置。通过跨时间滞后计算中心核对齐(CKA)来测量同步性,同时利用因果LSTM模型从说话者和倾听者角度探测提前的轮流提示信号。我们发现无噪声条件下同步性较强,接近零滞后,随着噪声增加而下降,并展示了内部状态编码了支持提前轮流预测的信息。

英文摘要

Full-duplex spoken dialogue models (SDMs) can listen and speak simultaneously, enabling interaction dynamics closer to human conversation than turn-based systems. Inspired by neural coupling in human communication, we study how such models coordinate their internal representations during interaction. We simulate full-duplex dialogues between two instances of the pretrained \textit{Moshi} model under controlled conditions, manipulating channel noise and decoding bias. Synchronization is measured using Centered Kernel Alignment (CKA) across temporal lags, while anticipatory turn-taking cues are probed from delayed internal activations using causal LSTM models, from both speaker and listener perspectives. We find strong representational synchronization under no noise conditions, peaking near zero lag and degrading with noise, and we show that internal states encode anticipatory information that supports turn-taking prediction ahead of time.

2605.20266 2026-05-21 cs.SD 版本更新

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

大型音频语言模型综述:通用性、可信度与展望

Kaiwen Luo, Zhenhong Zhou, Leo Wang, Liang Lin, Yang Xiao, Tianyu Shao, Yuanhe Zhang, Yuxuan Li, Miao Yu, Kailin Lyu, Jiaming Zhang, Dongrui Liu, Li Sun, Yueming Wu, Kai Li, Ting Dang, Xiaojun Jia, Rohan Kumar Das, Xinfeng Li, Siyuan Liang, Qiufeng Wang, Xingjun Ma, Jing Chen, Kun Wang, Junhao Dong, Deqing Zou, Yu Cheng, Xia Hu, Zhigang Zeng, Sen Su, Yang Liu, Yu-Gang Jiang, Philip S. Yu, Yew-Soon Ong

发表机构 * Nanyang Technological University(南洋理工大学) Independent Researcher(独立研究者) The University of Melbourne(墨尔本大学) North China Electric Power University(华北电力大学) Beijing University of Posts and Telecommunications(北京邮电大学) University of Chinese Academy of Sciences(中国科学院大学) University of Science and Technology of China(中国科学技术大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Shanghai AI Laboratory(上海人工智能实验室) Huazhong University of Science and Technology(华中科技大学) Tsinghua University(清华大学) Fortemedia Singapore(富媒体新加坡) Tencent(腾讯) Fudan University(复旦大学) Wuhan University(武汉大学) Chinese University of Hong Kong(香港中文大学) Chongqing University of Posts and Telecommunications(重庆邮电大学) University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 本文综述了大型音频语言模型的通用性、可信度及未来发展方向,探讨了其架构创新、对齐算法及安全风险,并提出了防御深入、因果音频世界建模等策略以提升音频智能的可信度。

详情
AI中文摘要

大型语言模型(LLMs)所建立的基础能力为多模态大型语言模型(MLLMs)铺平了道路,其中大型音频语言模型(LALMs)对于实现通用听觉智能至关重要。尽管其表现显著,但LALMs能力的提升远超确保其可信度的系统框架的发展。本文对LALMs的内生机制进行了全面调查,详细阐述了促进涌现推理的架构创新和对齐算法。具体而言,我们分析了向统一端到端框架的过渡以及连续声音信号的整合如何本质上扩大了攻击面。为了严格评估这些范式中的风险,我们建立了可信度的全面分类法,将关键漏洞如跨模态 Jailbreaking、潜在声音后门和生物特征隐私泄漏进行分类。我们通过六个分析支柱回顾了最先进技术:幻觉、鲁棒性、安全、隐私、公平性和认证。成熟进攻景观与未充分发展的防御之间的深刻不平衡进一步验证了面向音频智能的可信度差距和多维风险。最后,我们提出了一条战略路线图,倡导“防御深入”架构、因果音频世界建模和内在表示工程,以弥合经验表现与内在可信音频智能之间的差距。我们的项目已上传至GitHub https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs。

英文摘要

The foundational capabilities established by Large Language Models (LLMs) have paved the way for Multimodal Large Language Models (MLLMs), within which Large Audio Language Models (LALMs) are essential for realizing universal auditory intelligence. Despite their remarkable performance, the escalation of LALMs' capabilities has significantly outpaced the development of systemic frameworks to ensure their trustworthiness. This survey provides a comprehensive investigation into the endogenous mechanisms of LALMs, detailing the architectural innovations and alignment algorithms that facilitate emergent reasoning. Specifically, we analyze how the transition to unified end-to-end frameworks and the integration of continuous acoustic signals inherently expand the attack surface. To rigorously evaluate the risks within these paradigms, we establish a comprehensive taxonomy of trustworthiness, categorizing critical vulnerabilities such as cross-modal jailbreaking, latent acoustic backdoors, and biometric privacy leakage. We review the state-of-the-art through six analytical pillars: hallucination, robustness, safety, privacy, fairness, and authentication. The profound imbalance between a mature offensive landscape and underdeveloped defenses further validates the critical trustworthiness gaps and multidimensional risks facing audio-centric intelligence. Finally, we propose a strategic roadmap advocating for "Defense-in-Depth" architectures, causal auditory world modeling, and intrinsic representation engineering to bridge the gap between empirical performance and intrinsically trustworthy audio intelligence. Our project has been uploaded to GitHub https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs.

2605.20220 2026-05-21 cs.SD cs.IR cs.LG 版本更新

Advanced Scientific Methodology Plays Rossini

高级科学方法论应用于罗西尼

Silvia Licciardi, Daniela Macchione, Emmanuel Caronna, Elisa Francomano

发表机构 * University of Palermo, Department of Engineering(巴勒莫大学工程系) Conservatory Alfredo Casella(阿尔弗雷多·卡塞拉音乐学院)

AI总结 本文通过计算分析方法,对罗西尼为梅斯塔西奥的《Mi lagnerò tacendo》所作的音乐作品进行结构分析,揭示其旋律、和声及文本创作选择,为音乐文献学研究提供新的系统研究基础。

详情
AI中文摘要

音乐谱子提供了表演的基本指示,同时包含有时隐含的作曲家意图指示。作者的变体以及更复杂的与同一文本相关的修订系列,给分析研究带来了挑战。本研究在科学方法论应用于音乐文献学的背景下,提出了一种面向结构分析的方法,研究罗西尼为同一梅斯塔西奥阿里埃塔《Mi lagnerò tacendo》所作的多个作品之一。通过计算分析——包括解析、数据挖掘和图论——对旋律、和声及文本创作选择进行了严谨探讨。结果构成了该领域的独特贡献,为系统研究奠定了基础,支持文献学研究,并为使用生成模型研究创作过程铺平了道路。

英文摘要

A musical score provides the essential instructions for its performance while containing indications - at times implicit - regarding the composer's intentions. The presence of authorial variants, and even more so complex series of revisions associated with a single text, presents a challenging path for analytical study. This research, situated within the application of Scientific Methodologies to Music Philology, proposes a methodological approach oriented toward the structural analysis of one of the many settings composed by Gioachino Rossini on the same Metastasio arietta ``Mi lagnerò tacendo''. Through Computational Analysis - incorporating parsing, data mining, and graph theory - the melodic, harmonic, and textual compositional choices have been rigorously explored. The results constitute a significant unicum in the field, laying the foundation for a systematic study that supports philological research and paves the way for the use of generative models to investigate the creative process.

2603.00086 2026-05-21 cs.CL cs.AI cs.SD eess.AS 版本更新

Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization

基于迭代的LLM改进法用于法语临床访谈的转录与说话人识别

Ambre Marie, Thomas Bertin, Guillaume Dardenne, Gwenolé Quellec

发表机构 * LaTIM UMR 1101 INSERM(INSERM拉蒂姆UMR1101) University of Western Brittany(西布列塔尼大学) University of Rouen Normandy(诺曼底大学)

AI总结 本研究提出一种多轮LLM后处理架构,通过交替进行说话人识别和词识别流程,提高法语医疗对话的转录准确性和说话人归属,通过两个法语临床数据集的消融研究验证了四种设计选择的效果。

详情
AI中文摘要

法语医疗对话的自动语音识别仍然具有挑战性,自发临床语音的词错误率通常超过30%。本研究提出一种多轮LLM后处理架构,通过交替进行说话人识别和词识别流程来提高转录准确性和说话人归属。在两个法语临床数据集(自杀预防电话咨询和术前清醒神经外科会诊)上的消融研究调查了四种设计选择:模型选择、提示策略、流程顺序和迭代深度。使用Qwen3-Next-80B,Wilcoxon符号秩检验证实了在自杀预防对话上词错误率(WDER)的显著降低(p<0.05,n=18),同时在清醒神经外科会诊上保持稳定(n=10),零输出失败和可接受的计算成本(RTF 0.32),表明该方法在离线临床部署中的可行性,有待在更大语料库上验证。

英文摘要

Automatic speech recognition for French medical conversations remains challenging, with word error rates often exceeding 30% in spontaneous clinical speech. This study proposes a multi-pass LLM post-processing architecture alternating between Speaker Recognition and Word Recognition passes to improve transcription accuracy and speaker attribution. Ablation studies on two French clinical datasets (suicide prevention telephone counseling and preoperative awake neurosurgery consultations) investigate four design choices: model selection, prompting strategy, pass ordering, and iteration depth. Using Qwen3-Next-80B, Wilcoxon signed-rank tests confirm significant WDER reductions on suicide prevention conversations (p<0.05, n=18), while maintaining stability on awake neurosurgery consultations (n=10), with zero output failures and acceptable computational cost (RTF 0.32), suggesting feasibility for offline clinical deployment, pending validation on larger corpora.