arXivDaily arXiv每日学术速递 周一至周五更新
2605.16181 2026-05-18 cs.SD 版本更新

ARIA: A Diagnostic Framework for Music Training Data Attribution

ARIA:音乐训练数据归因的诊断框架

Changheon Han, Ashkan Panahi, Kıvanç Tatar

发表机构 * Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg(楚德大学计算机科学与工程系及哥德堡大学)

AI总结 ARIA框架通过分解音乐方面归因并结合可靠性诊断,揭示音乐生成中归因行为的差异,提升版权分析的准确性。

Comments Working Paper

详情
AI中文摘要

音乐生成的训练数据归因(TDA)必须回答两个版权分析所需的问题:哪些训练歌曲影响生成输出,以及在哪些音乐方面产生影响。现有方法将影响简化为单一标量,无法揭示主导影响的音乐方面。我们提出ARIA框架,将归因分解到音乐方面(符号音乐五种,音频三种),并结合从段级分数矩阵计算的可靠性诊断。它通过比较前K个归因曲目组内相似性与随机参考组的相似性,以及通过奇异值分解和列统计诊断分数矩阵。在可获得归因真实值的符号音乐模型中,可靠性诊断将四种归因方法的排名与真实值相同。在音频音乐生成模型中,ARIA揭示了TDA方法在归因行为上的显著差异,标记出检索曲目在不同查询中几乎相同而非反映每查询归因的分数矩阵,并通过每个编码器表面的音乐方面来表征嵌入相似性检索基线。ARIA通过与版权分析中考虑的音乐方面一致的每方面归因证据,共同产生结果。

英文摘要

Training data attribution (TDA) for music generation must answer two questions that copyright analysis requires, namely which training songs influence a generated output and along which musical aspects the influence operates. Existing methods reduce influence to a single scalar, without revealing which musical aspects are dominant in that influence. We propose ARIA, a framework that decomposes attribution along musical aspects (five for symbolic music, three for audio) and pairs the decomposition with reliability diagnostics computed from the segment-level score matrix. It measures within-group similarity among the top-K attributed tracks against random reference groups drawn from the training pool, and diagnoses the score matrix through its singular value decomposition and column statistics. On a symbolic-music model where attribution ground truth is available through counterfactual retraining, the reliability diagnostics rank four attribution methods identically to that ground truth. On an audio music generation model, ARIA reveals attribution behaviors that vary substantially across TDA methods, flags score matrices whose retrieved tracks are nearly identical across queries rather than reflecting per-query attribution, and characterizes embedding-similarity retrieval baselines by the musical aspect each encoder surfaces. Together, ARIA produces per-aspect attribution evidence aligned with the musical aspects considered under the idea-expression distinction in copyright analysis.

2605.15984 2026-05-18 cs.SD cs.AI cs.CR 版本更新

Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues

超越内容:一个综合的语音毒性数据集和检测框架,结合副语言线索

Zhongjie Ba, Liang Yi, Peng Cheng, Qingcao Li, Qinglong Wang, Li Lu

发表机构 * The State Key Laboratory of Blockchain and Data Security(区块链与数据安全国家重点实验室) Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security(杭州高新区(滨江)区块链与数据安全研究院) School of Cyber Science and Engineering(网络安全科学与工程学院)

AI总结 本文提出ToxiAlert-Bench数据集和双头神经网络框架,通过整合副语言线索提升语音毒性检测性能,实验显示方法在多个指标上优于现有基线。

详情
AI中文摘要

语音毒性检测已成为维护安全在线通信环境的关键挑战。然而,现有方法常忽视副语言线索(如情绪、语调和语速)的作用,而当前数据集多为文本基,限制了对副语言线索的建模。为此,我们提出ToxiAlert-Bench,包含30000多个音频片段,标注七种主要毒性类别和二十种细粒度标签,并标注毒性来源(文本或副语言)。我们还提出双头神经网络,包含两个任务特定分类头:一个用于识别敏感源(文本或副语言),另一个用于分类具体毒性类型。训练过程包括独立头训练和联合微调以减少任务干扰。为缓解数据类别不平衡,我们采用类平衡采样和加权损失函数。实验结果表明,利用副语言特征显著提升了检测性能,方法在多个评估指标上优于现有基线,宏F1分数提升21.1%,准确率提升13.0%。

英文摘要

Toxic speech detection has become a crucial challenge in maintaining safe online communication environments. However, existing approaches to toxic speech detection often neglect the contribution of paralinguistic cues, such as emotion, intonation, and speech rate, which are key to detecting speech toxicity. Moreover, current toxic speech datasets are predominantly text-based, limiting the development of models that can capture paralinguistic cues.To address these challenges, we present ToxiAlert-Bench, a large-scale audio dataset comprising over 30,000 audio clips annotated with seven major toxic categories and twenty fine-grained toxic labels. Uniquely, our dataset annotates toxicity sources -- distinguishing between textual content and paralinguistic origins -- for comprehensive toxic speech analysis.Furthermore, we propose a dual-head neural network with a multi-stage training strategy tailored for toxic speech detection. This architecture features two task-specific classification headers: one for identifying the source of sensitivity (textual or paralinguistic), and the other for categorizing the specific toxic type. The training process involves independent head training followed by joint fine-tuning to reduce task interference. To mitigate data class imbalance, we incorporate class-balanced sampling and weighted loss functions.Our experimental results show that leveraging paralinguistic features significantly improves detection performance. Our method consistently outperforms existing baselines across multiple evaluation metrics, with a 21.1% relative improvement in Macro-F1 score and a 13.0% relative gain in accuracy over the strongest baseline, highlighting its enhanced effectiveness and practical applicability.

2605.15831 2026-05-18 cs.SD cs.AI 版本更新

Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation

将音乐建模为时频图像:一种用于音乐生成的2D分词器

Yuqing Cheng, Xingyu Ma, Guochen Yu, Xiaotao Gu

发表机构 * Department of Music AI and Information Technology, Central Conservatory of Music(音乐人工智能与信息技术系,中央音乐学院) Zhipu AI(智谱AI)

AI总结 本文提出BandTok,一种面向生成的2D梅尔频谱分词器,通过单个共享码本生成梅尔频带token,提升自回归建模能力,实验表明其在数据有限情况下表现优异。

详情
AI中文摘要

自回归音乐生成高度依赖音频分词器。现有高保真编码器常使用残差多码本量化,虽保留重建质量但序列展平后语言建模复杂,因残差层次强序列依赖且放大误差积累。我们提出BandTok,一种面向生成的2D梅尔频谱分词器,通过单个共享码本生成梅尔频带token,生成物理可解释的时频token网格,具有更独立的token结构,更适合自回归建模。BandTok通过多尺度PatchGAN目标和EMA码本更新提升重建质量。我们进一步引入具有2D Rotary Position Embedding(2D RoPE)的自回归语言模型,以在生成过程中保持时间和频带结构。实验表明,BandTok优于残差码本分词器,在数据有限情况下表现优异。本工作源代码和生成演示已公开。

英文摘要

Autoregressive music generation depends strongly on the audio tokenizer. Existing high-fidelity codecs often use residual multi-codebook quantization, which preserves reconstruction quality but complicates language modeling after sequence flattening, as the residual hierarchy imposes strong sequential dependencies and can amplify error accumulation. We propose BandTok, a generation-oriented 2D Mel-spectrogram tokenizer that represents each frame with Mel-frequency band tokens from a single shared codebook. This design yields a physically interpretable time-frequency token grid with a more independent token structure, making it better suited for autoregressive modeling. BandTok improves reconstruction with a multi-scale PatchGAN objective and EMA codebook updates. We further introduce an autoregressive language model with 2D Rotary Position Embedding (2D RoPE) to preserve temporal and frequency-band structure during generation. Experiments show that BandTok improves over residual-codebook tokenizers and achieves strong results in a data-limited setting. The source code and generation demos for this work are publicly available.

2605.14736 2026-05-18 cs.SD cs.LG 版本更新

IsoNet: Spatially-aware audio-visual target speech extraction in complex acoustic environments

IsoNet:在复杂声学环境中具有空间意识的音频视觉目标语音提取

Dinanath Padhya, Sajen Maharjan, Binita Adhikari, Ishwor Raj Pokharel

发表机构 * Department of Electronics and Computer Engineering, Thapathali Campus, Institute of Engineering, Tribhuvan University(电子与计算机工程系,Thapathali校区,工程学院,塔波胡万大学)

AI总结 IsoNet通过结合多通道STFT特征、GCC-PHAT空间线索和面部条件视觉嵌入,实现了在紧凑麦克风阵列上的目标语音提取,其在复杂声学环境中的性能优于传统方法。

Comments 8 pages

详情
AI中文摘要

目标语音提取在紧凑设备中仍然具有挑战性,因为单耳神经模型缺乏空间证据,而经典波束成形器在麦克风孔径仅为几厘米时会失去分辨能力。我们提出了IsoNet,一种适用于紧凑4麦克风阵列的音频-视觉目标语音提取系统。IsoNet结合了复数多通道STFT特征、GCC-PHAT空间线索、面部条件视觉嵌入以及辅助到达方向监督,嵌入到U-Net掩码估计网络中。三种课程变体在25,000个模拟VoxCeleb混音上进行了训练,逐步增加SNR难度。在覆盖-1到10 dB SNR的困难测试集上,IsoNet-CL1实现了9.31 dB SI-SDR,比混合物提高了4.85 dB,PESQ 2.13和STOI 0.84。Oracle延迟和求和及MVDR波束成形器在相同混合物上分别降低了4.82 dB和6.08 dB SI-SDRi,表明所提出的学习多模态条件解决了传统空间过滤无效的领域。消融研究显示,视觉条件、GCC-PHAT特征和扩展延迟-bin编码带来了持续的增益。结果建立了紧凑阵列、面部可选语音提取的基线,在受控模拟中识别了剩余的现实部署障碍,特别是相位重建、多干扰源混合物和模拟到现实的转移。

英文摘要

Target speech extraction remains difficult for compact devices because monaural neural models lack spatial evidence and classical beamformers lose resolving power when the microphone aperture is only a few centimetres. We present IsoNet, a user-selectable audio-visual target speech extraction system for a compact 4-microphone array. IsoNet combines complex multi-channel STFT features, GCC-PHAT spatial cues, face-conditioned visual embeddings, and auxiliary direction-of-arrival supervision inside a U-Net mask estimation network. Three curriculum variants were trained on 25,000 simulated VoxCeleb mixtures with progressively difficult SNR regimes. On a hard test set spanning -1 to 10 dB SNR, IsoNet-CL1 achieves 9.31 dB SI-SDR, a 4.85 dB improvement over the mixture, with PESQ 2.13 and STOI 0.84. Oracle delay-and-sum and MVDR beamformers degrade the same mixtures by 4.82 dB and 6.08 dB SI-SDRi, respectively, showing that the proposed learned multimodal conditioning solves a regime where conventional spatial filtering is ineffective. Ablation studies show consistent gains from visual conditioning, GCC-PHAT features, and extended delay-bin encoding. The results establish a compact-array, face-selectable speech extraction baseline under controlled simulation and identify the remaining barriers to real deployment, especially phase reconstruction, multi-interferer mixtures, and simulation-to-real transfer.

2602.08556 2026-05-18 cs.SD 版本更新

Global Rotation Equivariant Phase Modeling for Speech Enhancement with Deep Magnitude-Phase Interaction

全局旋转等变相位建模用于语音增强的深度幅度-相位交互

Chengzhong Wang, Andong Li, Dingding Yao, Junfeng Li

发表机构 * Institute of Acoustics, Chinese Academy of Sciences(中国科学院声学研究所) Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences(中国科学院声学研究所语音声学与内容理解重点实验室) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出一种全局旋转等变的幅度-相位双流框架,通过强制相位流保持全局旋转等变性,提升语音增强中的相位建模效果,实验显示在相位检索、去噪、去回声和带宽扩展任务中均优于现有方法。

Comments Submitted to IEEE TASLP

详情
AI中文摘要

尽管深度学习在语音增强(SE)领域取得了进展,但有效的相位建模仍具挑战性,因为传统网络通常在平坦的欧几里得特征空间中操作,难以建模相位的底层圆拓扑结构。为此,我们提出了一种幅度-相位双流框架,通过强制全局旋转等变性(GRE)特性对齐相位流的内在圆几何结构。具体而言,我们引入了基于模数的信息交换模块(MPICM)和混合注意力双馈前馈网络(HADF)瓶颈,两者均设计用于在相位流中保持GRE。在相位检索、去噪、去回声和带宽扩展任务中进行了全面评估,以验证所提方法在多个先进基线上的优越性。值得注意的是,所提架构在相位检索任务中将相位距离减少了超过20%,并在零样本跨语料库去噪评估中将PESQ提高了超过0.1。在涉及混合失真的一般语音增强任务中,整体优势也得到确立。定性分析进一步表明,学习到的相位特征表现出明显的周期性模式,与相位的内在圆性质一致。源代码可在https://github.com/wangchengzhong/GRE-Net获取。

英文摘要

While deep learning has advanced speech enhancement (SE), effective phase modeling remains challenging, as conventional networks typically operate within a flat Euclidean feature space, which is not easy to model the underlying circular topology of the phase. To address this, we propose a magnitude-phase dual-stream framework that aligns the phase stream with its intrinsic circular geometry by enforcing Global Rotation Equivariance (GRE) characteristic. Specifically, we introduce a Magnitude-Phase Interactive Convolutional Module (MPICM) for modulus-based information exchange and a Hybrid-Attention Dual Feed-Forward Network (HADF) bottleneck for unified feature fusion, both of which are designed to preserve GRE in the phase stream. Comprehensive evaluations are conducted across phase retrieval, denoising, dereverberation, and bandwidth extension tasks to validate the superiority of the proposed method over multiple advanced baselines. Notably, the proposed architecture reduces Phase Distance by over 20\% in the phase retrieval task and improves PESQ by more than 0.1 in zero-shot cross-corpus denoising evaluations. The overall superiority is also established in universal SE tasks involving mixed distortions. Qualitative analysis further reveals that the learned phase features exhibit distinct periodic patterns, which are consistent with the intrinsic circular nature of the phase. The source code is available at https://github.com/wangchengzhong/GRE-Net.

2509.24550 2026-05-18 cs.LG cs.SD 版本更新

Training-Free Multimodal Guidance for Video to Audio Generation

无需训练的多模态引导用于视频到音频生成

Eleonora Grassucci, Giuliano Galadini, Giordano Cicchetti, Aurelio Uncini, Fabio Antonacci, Danilo Comminiello

发表机构 * Dept. of Information Engineering, Electronics, and Telecomm., Sapienza University of Rome, Italy(信息工程、电子与电信系,罗马大学萨皮恩扎)

AI总结 本文提出无需训练的多模态引导机制,用于视频到音频扩散生成,通过模态嵌入跨度强制视频、音频和文本的一致对齐,提升生成质量与多模态对齐效果。

详情
Journal ref
ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
AI中文摘要

视频到音频(V2A)生成旨在从静音视频中合成逼真且语义一致的音频,潜在应用于视频编辑、 Foley声音设计和辅助多媒体。尽管已有成果显著,现有方法或需在大规模配对数据集上进行昂贵的联合训练,或依赖成对相似性可能无法捕捉全局多模态一致性。本文提出一种新颖的无需训练的多模态引导机制,用于V2A扩散,利用模态嵌入所跨越的体积来强制视频、音频和文本之间的一致对齐。所提出的多模态扩散引导(MDG)提供了一种轻量级、即插即用的控制信号,可在任何预训练音频扩散模型上应用而无需重新训练。在VGGSound和AudioCaps上的实验表明,我们的MDG在感知质量和多模态对齐方面均优于基线,证明了联合多模态引导在V2A中的有效性。

英文摘要

Video-to-audio (V2A) generation aims to synthesize realistic and semantically aligned audio from silent videos, with potential applications in video editing, Foley sound design, and assistive multimedia. Although the excellent results, existing approaches either require costly joint training on large-scale paired datasets or rely on pairwise similarities that may fail to capture global multimodal coherence. In this work, we propose a novel training-free multimodal guidance mechanism for V2A diffusion that leverages the volume spanned by the modality embeddings to enforce unified alignment across video, audio, and text. The proposed multimodal diffusion guidance (MDG) provides a lightweight, plug-and-play control signal that can be applied on top of any pretrained audio diffusion model without retraining. Experiments on VGGSound and AudioCaps demonstrate that our MDG consistently improves perceptual quality and multimodal alignment compared to baselines, proving the effectiveness of a joint multimodal guidance for V2A.

2509.20641 2026-05-18 cs.LG cs.SD 版本更新

Investigating Modality Contribution in Audio LLMs for Music

在音乐音频大语言模型中探讨模态贡献

Giovana Morais, Magdalena Fuentes

发表机构 * Audio Research Lab, New York University, USA Integrated Design \& Media, New York University, USA

AI总结 本文通过MM-SHAP框架量化音频大语言模型中各模态的贡献,发现高准确率模型更依赖文本回答问题,但音频仍能局部化关键声音事件,首次将MM-SHAP应用于音频大语言模型。

Comments 5 pages, 2 figures, accepted at ICASSP 2026

详情
AI中文摘要

音频大语言模型(Audio LLMs)能够实现人类般的音乐对话,但尚不清楚它们是否真正听懂音频还是仅仅依赖文本推理,正如最近的基准测试所表明的。本文通过量化每个模态对模型输出的贡献来探讨这一问题。我们适应了MM-SHAP框架,这是一个基于Shapley值的性能无关评分,用于量化每个模态对模型预测的相对贡献。我们在MuChoMusic基准上评估了两个模型,并发现准确性更高的模型更依赖文本来回答问题,但进一步检查显示,即使整体音频贡献较低,模型仍能成功局部化关键声音事件,这表明音频并未被完全忽略。我们的研究是首次将MM-SHAP应用于音频大语言模型,我们希望它能为未来可解释AI和音频领域的研究奠定基础。

英文摘要

Audio Large Language Models (Audio LLMs) enable human-like conversation about music, yet it is unclear if they are truly listening to the audio or just using textual reasoning, as recent benchmarks suggest. This paper investigates this issue by quantifying the contribution of each modality to a model's output. We adapt the MM-SHAP framework, a performance-agnostic score based on Shapley values that quantifies the relative contribution of each modality to a model's prediction. We evaluate two models on the MuChoMusic benchmark and find that the model with higher accuracy relies more on text to answer questions, but further inspection shows that even if the overall audio contribution is low, models can successfully localize key sound events, suggesting that audio is not entirely ignored. Our study is the first application of MM-SHAP to Audio LLMs and we hope it will serve as a foundational step for future research in explainable AI and audio.

2507.15970 2026-05-18 cs.SD cs.AI eess.AS 版本更新

CIS-BWE: Chaos-Informed Speech Bandwidth Extension

CIS-BWE: 基于混沌的语音带宽扩展

Tarikul Islam Tamiti, Tonmoy Das, Nursadul Mamun, Anomadarshi Barua

发表机构 * Chittagong University of Engineering and Technology(奇坦加大学工程与技术学院) George Mason University(乔治·梅森大学)

AI总结 本文提出NDSI-BWE框架,利用六种基于非线性动力学系统的判别器捕捉语音的复杂时间行为,通过深度卷积实现参数减少,提升语音带宽扩展性能。

详情
AI中文摘要

恢复因带宽限制丢失的高频成分对于电信和有限资源下的高保真音频应用至关重要。我们引入NDSI-BWE,一种新的对抗性带宽扩展(BWE)框架,利用四种新的判别器灵感来自非线性动力学系统以捕捉多样的时间行为:多分辨率李雅普诺夫判别器(MRLD)用于确定初始条件的敏感性,通过捕捉确定性混沌;多尺度递归判别器(MS-RD)用于自相似递归动力学;多尺度去趋势分形分析判别器(MSDFA)用于长程缓慢变异性尺度不变关系;多分辨率庞加莱图判别器(MR-PPD)用于捕捉隐藏的潜在空间关系;多周期判别器(MPD)用于捕捉周期性模式;多分辨率振幅判别器(MRAD)和多分辨率相位判别器(MRPD)用于捕捉复杂的振幅-相位转换统计。通过在每个判别器中使用深度卷积块的核心深度卷积,NDSI-BWE实现了八倍的参数减少。这些七个判别器指导一个基于复数ConformerNeXt的生成器,采用双流Lattice-Net架构,同时优化幅度和相位。生成器利用基于Transformer的Conformer的全局依赖建模能力和ConvNeXt块的局部时间建模能力。在六个客观评估指标和包含五名人类评委的主观文本中,NDSI-BWE在BWE中建立了新的SoTA。

英文摘要

Recovering high-frequency components lost to bandwidth constraints is crucial for applications ranging from telecommunications to high-fidelity audio on limited resources. We introduce NDSI-BWE, a new adversarial Band Width Extension (BWE) framework that leverage four new discriminators inspired by nonlinear dynamical system to capture diverse temporal behaviors: a Multi-Resolution Lyapunov Discriminator (MRLD) for determining sensitivity to initial conditions by capturing deterministic chaos, a Multi-Scale Recurrence Discriminator (MS-RD) for self-similar recurrence dynamics, a Multi-Scale Detrended Fractal Analysis Discriminator (MSDFA) for long range slow variant scale invariant relationship, a Multi-Resolution Poincaré Plot Discriminator (MR-PPD) for capturing hidden latent space relationship, a Multi-Period Discriminator (MPD) for cyclical patterns, a Multi-Resolution Amplitude Discriminator (MRAD) and Multi-Resolution Phase Discriminator (MRPD) for capturing intricate amplitude-phase transition statistics. By using depth-wise convolution at the core of the convolutional block with in each discriminators, NDSI-BWE attains an eight-times parameter reduction. These seven discriminators guide a complex-valued ConformerNeXt based genetor with a dual stream Lattice-Net based architecture for simultaneous refinement of magnitude and phase. The genertor leverage the transformer based conformer's global dependency modeling and ConvNeXt block's local temporal modeling capability. Across six objective evaluation metrics and subjective based texts comprises of five human judges, NDSI-BWE establishes a new SoTA in BWE.

2506.12405 2026-05-18 cs.SD 版本更新

Methods for pitch analysis in contemporary popular music: multiple pitches from harmonic tones in Vitalic's music

当代流行音乐中音高分析的方法:来自Vitalic音乐中和声音的多重音高

Emmanuel Deruty, David Meredith, Maarten Grachten, Pascal Arbez-Nicolas, Andreas Hasselholt Jørgensen, Oliver Søndermølle Hansen, Magnus Stensli, Christian Nørkær Petersen

发表机构 * Sony Computer Science Laboratories, Paris, France(索尼计算机科学实验室,法国巴黎) Department of Architecture, Design and Media Technology, Aalborg University, Aalborg, Denmark(建筑、设计与媒体技术系,奥尔堡大学,丹麦奥尔堡) Citizen Records, Dijon, France(公民唱片公司,法国第戎)

AI总结 研究探讨了当代流行音乐中单个和声复合音产生多个感知音高的现象,通过Vitalic等电子艺术家的作品示例,分析信号特征与音高感知之间的关系,并发现不同听众对多重模糊音高的感知存在显著差异。

Comments Pending review, Journal of the Audio Engineering Society

详情
AI中文摘要

目的。本研究提出,单个和声复合音产生多个感知音高是当代流行音乐的主动和有意特征。通过Vitalic等电子艺术家作品中的例子加以说明。方法。进行了两项听觉测试:(1) 评估单个和声音产生的同时感知音高的数量,(2) 手动转录和声音序列的音高。随后分析了信号特征与音高感知之间的关系。结果。研究中发现的合成和声音在音乐序列中比其声学对应物传递更多的感知音高,不同听众之间存在显著差异。多重模糊音高与和声音的特性如显著的上部谐波和特定的自相关谱形有关。结论。在当代流行音乐的背景下,和声音可以一般地传达多个模糊音高。感知的音高集合取决于听众和听音条件。

英文摘要

Aims. This study suggests that the use of multiple perceived pitches arising from a single harmonic complex tone is an active and intentional feature of contemporary popular music. The phenomenon is illustrated through examples drawn from the work of electronic artist Vitalic and others. Methods. Two listening tests were conducted: (1) evaluation of the number of simultaneous pitches perceived from single harmonic tones, and (2) manual pitch transcription of sequences of harmonic tones. Relationships between signal characteristics and pitch perception were then analyzed. Results. The synthetic harmonic tones found in the musical sequences under study were observed to transmit more perceived pitches than their acoustic counterparts, with significant variation across listeners. Multiple ambiguous pitches were associated with tone properties such as prominent upper partials and particular autocorrelation profiles. Conclusions. Harmonic tones in a context of contemporary popular music can, in general, convey several ambiguous pitches. The set of perceived pitches depends on both the listener and the listening conditions.

2506.07073 2026-05-18 cs.SD cs.HC eess.AS 版本更新

Insights on Harmonic Tones from a Generative Music Experiment

从生成音乐实验中洞察和声音调

Emmanuel Deruty, Maarten Grachten

发表机构 * Sony Computer Science Laboratories - Paris(索尼计算机科学实验室-巴黎) Department of Architecture, Design and Media Technology(建筑、设计与媒体技术系) Aalborg University(奥尔堡大学) Contractor for Sony CSL Paris(索尼 CSL 巴黎承包商)

AI总结 生成音乐AI旨在提升音乐创作,实验显示AI模型能生成结构化和声音调,揭示人类对和声的感知问题,推动音乐创造力与理论理解。

Comments 15th International Workshop on Machine Learning and Music, September 9, 2024, Vilnius, Lithuania

详情
AI中文摘要

生成音乐AI的最终目的是音乐创作。在艺术科学交叉领域的工作室-实验室中,通过研究人员、音乐制作人和生成低音音频的AI模型进行实验,发现制作人利用模型输出传达两个或更多音高,表明模型能通过单个和声复音生成结构化、连贯的同时旋律线。这些发现促使重新审视人类是否能将和声视为独立音高,同时展示生成AI如何提升音乐创造力并深化音乐理解。

英文摘要

The ultimate purpose of generative music AI is music production. The studio-lab, a social form within the art-science branch of cross-disciplinarity, is a way to advance music production with AI music models. During a studio-lab experiment involving researchers, music producers, and an AI model for music generating bass-like audio, it was observed that the producers used the model's output to convey two or more pitches with a single harmonic complex tone, which in turn revealed that the model had learned to generate structured and coherent simultaneous melodic lines using monophonic sequences of harmonic complex tones. These findings prompt a reconsideration of the long-standing debate on whether humans can perceive harmonics as distinct pitches and highlight how generative AI can not only enhance musical creativity but also contribute to a deeper understanding of music.

2605.15307 2026-05-18 cs.GR cs.CV cs.MM cs.SD 版本更新

Sound Sparks Motion: Audio and Text Tuning for Video Editing

声音激发动作:用于视频编辑的音频和文本微调

AmirHossein Naghi Razlighi, Aryan Mikaeili, Ali Mahdavi-Amiri, Daniel Cohen-Or, Yiorgos Chrysanthou

AI总结 本文提出Sound Sparks Motion框架,通过测试时调整音频视觉生成模型的多模态条件信号,实现视频动作编辑,无需训练,通过音频潜在和文本条件残差扰动促进动作修改,同时利用视觉语言模型反馈提升编辑效果。

Comments Project Page: https://amirhossein-razlighi.github.io/Sound_Sparks_Motion

详情
AI中文摘要

以动作为中心的视频编辑仍然对大生成视频模型来说具有挑战性,这些模型通常对外观变化反应良好,但难以在现有片段中生成特定的局部动作或状态转换。我们介绍了Sound Sparks Motion,一种无需训练的框架,通过在测试时调整音频视觉视频生成模型的内部多模态条件信号,实现动作编辑。与修改模型权重不同,我们的方法仅调整两个轻量级变量:从源视频导出的音频潜在和文本条件的残差扰动。我们发现这种组合可以鼓励动作编辑,这些动作在仅通过提示控制时,底层模型往往难以实现。由于没有直接方法评估文本和动作之间的时间对齐,我们利用视觉语言模型提供反馈,指示生成视频中是否出现了预期的动作。这种简单的监督产生了一个有效的语义目标用于动作编辑,而正则化和感知-时间约束有助于保持内容和视觉质量。除了单视频调整外,我们还表明学习到的潜在控制可以跨视频转移,表明它们捕捉了可重用的动作编辑方向,而不是过拟合到单个示例。我们的结果强调了多模态条件调整,特别是通过音频路径,作为动作感知视频编辑的有前途的方向,并表明测试时调整可以作为轻量级的探测机制,帮助揭示模型多模态条件中嵌入的动作控制。代码和数据可通过我们的项目页面获取:https://amirhossein-razlighi.github.io/Sound_Sparks_Motion/

英文摘要

Motion-centric video editing remains difficult for large generative video models, which often respond well to appearance changes but struggle to produce specific, localized actions or state transitions in an existing clip. We introduce Sound Sparks Motion, a training-free framework that enables motion editing in an audio-visual video generation model by tuning its internal multimodal conditioning signals at test time. Rather than modifying model weights, our method tunes only two lightweight variables: an audio latent derived from the source video and a residual perturbation in the text-conditioning. We find that this combination can encourage motion edits that the underlying model often struggles to realize under prompt-only control. Since there is no direct way to evaluate temporal alignment between text and motion, we guide the tuning process using a vision-language model that provides feedback indicating whether the intended motion appears in the generated video. This simple supervision yields an effective semantic objective for motion editing, while regularization and perceptual-temporal constraints help preserve content and visual quality. Beyond per-video tuning, we show that the learned latent controls are transferable across videos, suggesting that they capture reusable motion-edit directions rather than overfitting to a single example. Our results highlight multimodal conditioning tuning, particularly through the audio pathway, as a promising direction for motion-aware video editing, and suggest that test-time tuning can serve as a lightweight probing mechanism that helps reveal latent motion controls embedded in the model's multimodal conditioning. Code and data are available via our project page: https://amirhossein-razlighi.github.io/Sound_Sparks_Motion/

2506.23552 2026-05-18 cs.CV cs.SD eess.AS 版本更新

JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

Mingi Kwon, Joonghyuk Shin, Jaeseok Jung, Jaesik Park, Youngjung Uh

AI总结 本文提出了一种名为 JAM-Flow 的统一框架,用于同时生成面部运动和语音信号,解决了传统方法中将人脸生成与语音合成作为独立任务处理的问题。该方法结合了流匹配技术和一种新型的多模态扩散变换器(MM-DiT)架构,通过选择性联合注意力层实现跨模态交互,并保留各模态的特性。JAM-Flow 能够在单一模型中支持多种条件输入,如文本、参考音频和参考运动,从而实现从文本生成同步说话人脸、音频驱动动画等多种任务,显著推进了多模态生成建模的发展。

Comments project page: https://joonghyuk.com/jamflow-web Under review. Preprint published on arXiv

详情
英文摘要

The intrinsic link between facial motion and speech is often overlooked in generative modeling, where talking head synthesis and text-to-speech (TTS) are typically addressed as separate tasks. This paper introduces JAM-Flow, a unified framework to simultaneously synthesize and condition on both facial motion and speech. Our approach leverages flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture, integrating specialized Motion-DiT and Audio-DiT modules. These are coupled via selective joint attention layers and incorporate key architectural choices, such as temporally aligned positional embeddings and localized joint attention masking, to enable effective cross-modal interaction while preserving modality-specific strengths. Trained with an inpainting-style objective, JAM-Flow supports a wide array of conditioning inputs-including text, reference audio, and reference motion-facilitating tasks such as synchronized talking head generation from text, audio-driven animation, and much more, within a single, coherent model. JAM-Flow significantly advances multi-modal generative modeling by providing a practical solution for holistic audio-visual synthesis. project page: https://joonghyuk.com/jamflow-web