arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.28810 2026-05-28 cs.LG cs.IR cs.SD 版本更新

Affective Music Recommendation: A Rollout-Based World Model for Offline Preference Optimization

情感音乐推荐:基于展开世界模型的离线偏好优化

Audrey Chan, Aaron Labbé, Jacob Lavoie, Jordan Bannister, Arsène Fansi Tchango, Guillaume Lajoie, Laurent Charlin

发表机构 * LUCID Inc. Toronto Canada LUCID Inc. Montr\' e al Canada Mila --- Qu\' e bec AI Institute Montr\' e al Canada LUCID Inc. Mila --- Qu\' e bec AI Institute

AI总结 针对在线情感实验受伦理限制的问题,提出基于展开世界模型的情感音乐推荐系统AMRS,利用因果Transformer预测用户情感状态,并通过离线偏好优化提升推荐效果。

详情
AI中文摘要

功能性音乐应用,从消费者专注和睡眠辅助到临床干预,共享一个独特的推荐问题:成功由听者的情感状态定义,但在情感上的在线实验受到伦理约束,特别是对于临床人群,他们无法可靠地跳过歌曲或报告痛苦。我们描述了AMRS,即部署在LUCID健康与 wellness 平台上的情感音乐推荐系统,该平台服务于临床用户(主要是患有神经认知状况的老年人)和消费者 wellness 用户,涵盖活力、专注、平静和睡眠模式。AMRS围绕一个基于展开的世界模型构建:一个在记录的收听数据上训练的因果Transformer,用于联合预测参与度、二元评分以及自我报告的效价和唤醒度。该世界模型既作为离线策略训练的模拟器,也作为部署前的压力测试工具。通过行为克隆初始化的推荐策略,使用直接偏好优化(DPO)针对可配置的多目标效用函数进行离线微调。在严格的冷启动协议下,世界模型以可用的保真度预测行为和情感信号;DPO在保持相似多样性分布并避免贪婪优化产生的分布崩溃的同时,提高了预测的效价和唤醒度,优于克隆基线。我们将这项工作定位为一种在在线实验伦理上不可行时进行情感推荐的方法的早期部署验证。

英文摘要

Functional music applications, from consumer focus and sleep aids to clinical interventions, share a distinctive recommendation problem: success is defined by the listener's affective state, but online experimentation on emotion is ethically constrained, particularly for clinical populations who cannot reliably skip a song or report distress. We describe AMRS, the Affective Music Recommendation System deployed on LUCID's health-and-wellness platforms, which serve clinical users (primarily older adults with neurocognitive conditions) and consumer-wellness users across energize, focus, calm, and sleep modes. AMRS is built around a rollout-based world model: a causal transformer trained on logged listening data to jointly predict engagement, binary rating, and self-reported valence and arousal. The world model serves both as an in-silico simulator for offline policy training and as a stress-testing tool before deployment. A recommender policy initialized by behaviour cloning is fine-tuned offline with Direct Preference Optimization (DPO) against a configurable multi-objective utility function. Under a strict cold-start protocol, the world model predicts both behavioural and affective signals with usable fidelity; DPO improves predicted valence and arousal over the cloned baseline while maintaining a similar diversity profile and avoiding the distributional collapse produced by greedy optimization. We position the work as an early deployed validation of a methodology for affective recommendation when online experimentation is ethically untenable.

2605.28687 2026-05-28 cs.SD physics.med-ph 版本更新

Cross-modal characterization of infant cry: validation of a chest-surface accelerometer in extracting acoustic vocal function measures

婴儿哭声的跨模态表征:胸表加速度计在提取声学发声功能测量中的验证

Winko W. An, Saketh Sundar, Lisa Yankowitz, Daryush D. Mehta, Carol L. Wilkinson

发表机构 * Division of Developmental Medicine, Boston Children’s Hospital(发育医学部,波士顿儿童医院) Harvard Medical School(哈佛医学院) Harvard University(哈佛大学) Children’s Hospital of Philadelphia(费城儿童医院) Center for Laryngeal Surgery and Voice Rehabilitation, Massachusetts General Hospital(嗓音康复中心,麻省总医院)

AI总结 本研究验证了胸表加速度计在婴儿哭声分析中的有效性,发现其能可靠捕获基频和抖动等声学特征,为噪声鲁棒且保护隐私的临床研究提供替代方案。

详情
AI中文摘要

背景:婴儿哭声声学为早期神经发育提供了有前景的窗口,并可能作为神经发育障碍的可扩展生物标志物。然而,传统的基于麦克风的录音在现实临床环境中极易受到环境噪声的影响,并引发隐私问题。胸表加速度计通过直接捕获来自喉部的振动,可能提供一种稳健的替代方案。方法:我们通过比较常规疫苗接种期间从加速度计和同步记录的麦克风信号中提取的声学特征,评估了胸戴加速度计用于婴儿哭声分析的有效性。最终样本包括来自多样化儿科人群的85名婴儿(41名4个月大;44名12个月大)。从两种模态中提取了七种发声测量指标,包括基频、抖动、 shimmer、倒谱峰值突出度和谐波噪声比。使用组内相关系数评估模态间的一致性和一致性。结果:加速度计和麦克风录音之间的基频表现出极好的一致性(ICC > 0.94)。抖动测量也显示出良好到极好的一致性,而倒谱峰值突出度显示出中等一致性。Shimmer和谐波噪声比在模态间显示出较低的一致性绝对值和系统偏差,反映了信号传输和噪声敏感性可能存在的差异。结论:总之,胸表加速度计可以可靠地捕获婴儿哭声的几个临床相关声学特征,特别是基频和抖动的时间测量。这种方法为基于麦克风的录音提供了一种噪声鲁棒且保护隐私的替代方案,支持其在可扩展的临床和发育研究应用中的潜在用途。

英文摘要

Background: Infant cry acoustics provide a promising window into early neurodevelopment and may serve as scalable biomarkers for neurodevelopmental disorders. However, conventional microphone-based recordings are highly susceptible to environmental noise and raise privacy concerns in real-world clinical settings. Chest-surface accelerometers may offer a robust alternative by capturing vibrations directly from the larynx. Methods: We evaluated the validity of a chest-mounted accelerometer (ACC) for infant cry analysis by comparing acoustic features derived from ACC and simultaneously recorded microphone (MIC) signals during routine vaccination visits. The final sample included 85 infants (41 at 4 months; 44 at 12 months) from a diverse pediatric population. Seven vocal measures were extracted from both modalities, including fundamental frequency (F0), jitter, shimmer, cepstral peak prominence (CPP), and harmonics-to-noise ratio (HNR). Agreement and consistency between modalities was assessed using intraclass correlation coefficients (ICCs). Results: F0 demonstrated excellent agreement between ACC and MIC recordings (ICC > 0.94). Jitter measures also showed good-to-excellent agreement, while CPP demonstrated moderate agreement. Shimmer and HNR showed lower absolute agreement and systematic bias between modalities, reflecting possible differences in signal transmission and noise sensitivity. Conclusion: In summary, chest-surface accelerometers can reliably capture several clinically relevant acoustic features of infant cry, particularly temporal measures of F0 and jitter. This approach offers a noise-robust and privacy-preserving alternative to microphone-based recordings, supporting its potential use in scalable clinical and developmental research applications.

2605.28657 2026-05-28 cs.SD 版本更新

DEMON: Diffusion Engine for Musical Orchestrated Noise

DEMON: 音乐编排噪声的扩散引擎

Ryan Fosdick

发表机构 * GitHub

AI总结 提出DEMON实时扩散引擎,通过异构去噪调度、共享可变状态、逐帧源混合和窗口化VAE解码四种机制,使去噪过程可作为现场乐器演奏,在单GPU上实现每秒12.3次60秒音乐解码。

Comments 15 pages, 3 figures, 15 tables. Project page with audio samples and demo video: https://daydreamlive.github.io/DEMON/

详情
AI中文摘要

我们提出DEMON,一个实时扩散引擎,使去噪过程可作为现场乐器演奏:控制面既宽广(每帧跨输出塑造多个参数)又响应迅速(每个控制在其去噪循环中的位置允许下尽快生效)。基于ACE-Step 1.5和StreamDiffusion的环形缓冲区架构,并采用TensorRT加速,在单消费级GPU(RTX 5090)上,对于60秒音乐,每秒可完成最多12.3次解码器完成,或在我们的生产环深度4下每秒生成11.3次。在这些速率下,去噪参数可作为现场表演控制,但环形缓冲区仅以其排出速率(S个去噪步的下限)传播每次请求的变化。我们贡献了四种机制。(1)每槽异构去噪调度:每个环形缓冲区槽拥有自己的时间步调度,因此移动的去噪滑块无需清除飞行队列即可被跟踪,而上游全局调度设计必须重建并丢弃队列。(2)共享可变的每步状态,使每个求解器步骤中查询的任何参数在下一拍生效,绕过环形缓冲区排出。(3)逐帧源混合:对标准SDE重噪步骤的采样时间控制,提供逐帧变换强度轴,补充标量去噪调度。(4)窗口化VAE解码,利用感受野分析实现8.0倍解码加速。这些机制将流式扩散参数按起始和收敛延迟分为四个传播类别。

英文摘要

We present DEMON, a real-time diffusion engine that makes the denoising process playable as a live musical instrument: a control surface both broad (many parameters shaped per-frame across the output) and responsive (each control taking effect as fast as its place in the denoising loop allows). Built on ACE-Step 1.5 and StreamDiffusion's ring-buffer architecture with TensorRT acceleration, it sustains up to 12.3 decoder completions per second for 60-second music on a single consumer GPU (RTX 5090), or 11.3 generations per second at our production ring-depth of 4. At these rates denoising parameters become viable as live performance controls, but the ring buffer propagates per-request changes only at its drain rate, a floor of S denoising steps. We contribute four mechanisms. (1) Per-slot heterogeneous denoise scheduling: each ring-buffer slot owns its timestep schedule, so a moving denoise slider is tracked without wiping the in-flight queue, where the upstream global-schedule design must rebuild and discard it. (2) Shared mutable per-step state, giving any parameter consulted at every solver step next-tick effect, bypassing ring-buffer drain. (3) Per-frame source blending: a sampling-time control on the standard SDE re-noise step, giving a framewise transformation-strength axis that complements scalar denoise scheduling. (4) Windowed VAE decode exploiting receptive-field analysis for an 8.0x decode speedup. Together these separate streaming-diffusion parameters into four propagation classes, by onset and convergence latency.

2605.28480 2026-05-28 eess.AS cs.SD 版本更新

Audio-Mind: An Auditable Agentic Framework for Audio Understanding

Audio-Mind: 一种可审计的音频理解智能体框架

Yucheng Wang, Jing Peng, Hanqi Li, Chenghao Wang, Wenming Tu, Yu Xi, Zhaokai Sun, Kai Yu, Shuai Wang

发表机构 * School of Intelligence Science and Technology, Nanjing University, China(南京大学智能科学与技术学院) Department of Computer Science, ETH Zürich, Switzerland(苏黎世联邦理工学院计算机科学系) X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China(上海交通大学计算机科学学院X-LANCE实验室) School of Automation Science and Engineering, Xi’an Jiaotong University, China(西安交通大学自动化科学与工程学院) School of Computer Science, Northwestern Polytechnical University, China(西北工业大学计算机科学学院)

AI总结 提出Audio-Mind框架,通过条件性证据获取动态结合强前端与规划器引导的工具使用,解决音频理解中智能体证据获取的时机问题,在MMAR和MSU-Bench上分别达到80.4%和82.8%的准确率,并生成可审计的推理轨迹。

详情
AI中文摘要

音频智能体通过将音频问题分解为工具调用、中间证据和迭代推理步骤来扩展大型音频语言模型(LALM)。然而,随着LALM变得更强,关键挑战从启用工具使用转变为确定智能体证据获取何时真正有益于音频理解。我们提出Audio-Mind,一个用于音频理解中条件性证据获取的可审计且可插拔框架。Audio-Mind动态结合强前端与规划器引导的工具使用,在初始证据足够时保留前端判断,同时为存在未解决证据差距的问题获取有界的外部证据。在MMAR和MSU-Bench上的实验表明,Audio-Mind优于先前的音频智能体基线,在MMAR上达到80.4%的准确率,在MSU-Bench上达到82.8%的准确率。匹配骨干网络的比较突显了这种设计的重要性:在强音频前端下,如果工作流不保留前端的整体音频基础判断,智能体分解可能成为编排瓶颈。除了准确性,Audio-Mind还产生更高质量、可审计的推理轨迹,暴露不确定性、工具证据和答案理由,为更可靠的音频问答标注和错误分析提供潜在基础。

英文摘要

Audio agents extend large audio-language models (LALMs) by decomposing audio questions into tool calls, intermediate evidence, and iterative reasoning steps. However, as LALMs become stronger, the key challenge shifts from enabling tool use to determining when agentic evidence acquisition genuinely benefits audio understanding. We propose Audio-Mind, an auditable and pluggable framework for conditional evidence acquisition in audio understanding. Audio-Mind dynamically combines a strong frontend with planner-guided tool use, preserving frontend judgment when initial evidence is sufficient while acquiring bounded external evidence for questions with unresolved evidence gaps. Experiments on MMAR and MSU-Bench show that Audio-Mind outperforms prior audio-agent baselines, reaching 80.4% accuracy on MMAR and 82.8% accuracy on MSU-Bench. A matched-backbone comparison highlights why this design matters: under strong audio frontends, agentic decomposition can become an orchestration bottleneck when the workflow does not preserve the frontend's holistic audio-grounded judgment. Beyond accuracy, Audio-Mind produces higher-quality, auditable reasoning traces that expose uncertainty, tool evidence, and answer rationales, offering a potential basis for more reliable audio-QA annotation and error analysis.

2605.24678 2026-05-28 cs.AI cs.CL cs.SD 版本更新

Exploration of Perceptual Speech Features for Clinical Decision-Support in Mental Health Care

探索感知语音特征用于心理健康护理中的临床决策支持

Vassilis Lyberatos, Edmund G. Dervakos, Eleni Adamidi, Athanasios Voulodimos, Giorgos Stamou

发表机构 * National Technical University of Athens(国家技术大学雅典) PsychNow

AI总结 提出一个基于感知声学和语言特征(如韵律、嗓音质量、语义连贯性、句法结构和讽刺)的系统分析框架,结合统计分析和可解释机器学习(XGBoost与SHAP和LIME),在多个数据集上发现语音特征与抑郁、焦虑和ADHD症状严重度之间的稳定关联,并通过消融研究识别最具信息量的特征组。

Comments Accepted to CLPsych 2026, part of ACL 2026

详情
AI中文摘要

语音和语言技术通过客观且可解释的线索为支持心理健康评估提供了宝贵的机会。我们提出了一个系统的基于特征的分析框架,利用感知基础的声学和语言特征,包括韵律、嗓音质量、语义连贯性、句法结构和讽刺。通过统计分析和可解释机器学习(XGBoost与SHAP和LIME),我们研究了语音特征与抑郁、焦虑和ADHD的已验证症状测量之间的关联。在受控基准数据集(StressID、DAIC-WOZ、Androids、EATD)和真实世界临床数据集上的评估表明,该框架揭示了症状严重度与嗓音不规则性(如shimmer、jitter)、词汇-句法模式和情感基调之间的稳定且一致的关系。跨所有数据集进行的消融研究进一步识别了最具信息量的特征组。这项工作探索了一种透明且临床可解释的基于语音的心理健康分析方法。

英文摘要

Speech and language technologies offer valuable opportunities for supporting mental health assessment through objective and interpretable cues. We present a systematic feature-based analysis framework leveraging perceptually grounded acoustic and linguistic characteristics, including prosody, vocal quality, semantic coherence, syntactic structure, and sarcasm. Using statistical analysis and interpretable machine learning (XGBoost with SHAP and LIME), we examine associations between speech features and validated symptom measures of depression, anxiety, and ADHD. Evaluated on both controlled benchmark datasets (StressID, DAIC-WOZ, Androids, EATD) and a real-world clinical dataset, the framework reveals stable and consistent relationships between symptom severity and vocal irregularities (e.g., shimmer, jitter), lexical-syntactic patterns, and affective tone. An ablation study conducted across all datasets further identifies the most informative feature groups. This work explores a transparent and clinically interpretable approach to speech-based mental health analysis.

2605.28101 2026-05-28 cs.SD cs.AI cs.MM 版本更新

EigeNet: Geometry-Informed Multi-Modal Learning for Few-shot Novel View RIR Prediction

EigeNet:几何信息引导的多模态学习用于少样本新视角RIR预测

Chong Jing, Zitong Lan, Junan Zhang, Zhizheng Wu

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出EIGENET框架,通过跨视角交替注意力Transformer和几何信息调制块,结合多任务学习,实现少样本新视角房间脉冲响应预测,达到最先进性能。

Comments Code available on https://github.com/FEAfeatherTHER/EigeNet

详情
AI中文摘要

从稀疏观测中预测空间变化的房间脉冲响应(RIR)是沉浸式空间音频渲染中一个关键但极具挑战性的逆问题。在这项工作中,我们提出了EIGENET,一个几何信息引导的多模态框架,用于少样本新视角RIR预测。其核心是一个跨视角交替注意力Transformer,它迭代地细化局部视角内声学结构和全局跨视角空间关系。我们通过实验证明,该架构能够在进行时空推理以预测RIR的同时,充分利用多视角多模态上下文。受声学射线追踪启发,我们设计了一个几何信息调制块,以建立几何特征与RIR功率谱之间的联系。同时,引入辅助损失将单目标波形预测转化为多任务学习框架。通过消融研究,我们证明无论底层骨干网络如何,该设计都能带来一致的性能提升,从而确认了其在RIR预测任务中的基础实用性和架构无关的泛化能力。在模拟和真实世界基准上的评估表明,EIGENET在少样本新视角RIR预测和模拟到真实泛化方面均达到了最先进的性能。代码和检查点可在 https://github.com/FEAfeatherTHER/EigeNet 获取。

英文摘要

Predicting spatially varying Room Impulse Response (RIR) from sparse observations is a critical but highly challenging inverse problem for immersive spatial audio rendering. In this work, we present EIGENET, a geometry-informed multi-modal framework for few-shot novel view RIR prediction. At its core is a Cross-view Alternate-attention Transformer that iteratively refines local intra-view acoustic structures and global cross-view spatial relationships. We empirically demonstrate that this architecture is capable of making full use of the multi-view multi-modal context while performing spatial-temporal reasoning for RIR prediction. Inspired by acoustic ray tracing, we design a geometry-informed modulation block to formulate the connection between geometric features and RIR power spectrum. In the mean time, an auxiliary loss is introduced to transform the single-target waveform prediction into a multi-task learning framework. Through ablation studies, we demonstrate that this design yields consistent performance gains regardless of the underlying backbone, thereby confirming its foundational utility and architecture-agnostic generalizability for RIR prediction task. Evaluated on both simulated and real-world benchmarks, EIGENET achieves both state-of-the-art performance in few-shot novel view RIR prediction and sim-to-real generalization. Codes and checkpoints are available on https://github.com/FEAfeatherTHER/EigeNet.

2605.28063 2026-05-28 cs.SD cs.AI cs.MM 版本更新

Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

自由文本提示的统一语音与声音合成

Yuyue Wang, Xihua Wang, Xin Cheng, Yijing Chen, Ruihua Song

发表机构 * Renmin University of China(中国人民大学)

AI总结 提出PlanAudio框架,利用大语言模型推理能力和语义潜在思维链机制,直接从自由文本生成包含语音和声音的统一音频。

详情
AI中文摘要

音频生成已取得显著进展,但合成语音与声音自然组合的统一音频仍具挑战。当前方法要么依赖分离的流水线,无法捕捉细粒度交互,要么需要结构化输入和外部文本重写,限制了自由文本提示的灵活性。本文提出新任务:自由文本提示到统一音频生成,旨在直接从无约束自然语言合成包含语音、声音及其复合的统一音频。为此,我们提出PlanAudio,一个统一的、基于自回归LLM的框架。首先,它利用LLM内在推理能力简化模型架构,而非传统文本编码器。其次,引入语义潜在思维链机制,一种隐式规划机制,连接高层语义理解与低层声学合成。此外,我们创建PlanAudio-Bench,一个专门评估复合音频场景的基准。我们在语音、声音及其复合场景下进行评估。结果表明,PlanAudio普遍优于现有流水线和统一基线,同时与专为单一场景设计的模型保持竞争力。进一步分析揭示了语义潜在CoT相对于其他CoT机制的优越性,并强调了连续多场景训练课程的重要性。

英文摘要

Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a challenge. Current methods either rely on disjoint pipelines, which fail to capture fine-grained interactions, or require structured inputs and external text rewriting, which limits the flexibility of free-form text prompts. In this paper, we introduce a new task: Free-Form-Text-Prompt-to-Unified-Audio generation, which aims to directly synthesize unified audio containing speech, sound, and their composites from unconstrained natural language. To address this task, we propose PlanAudio, a unified, autoregressive LLM-based framework. First, it simplifies the model architecture by leveraging intrinsic LLM reasoning capability instead of traditional text encoders. Second, it introduces a semantic latent chain-of-thought mechanism, an implicit planning mechanism that bridges high-level semantic understanding and low-level acoustic synthesis. Furthermore, we create PlanAudio-Bench, a specialized benchmark for evaluating composite audio scenarios. We perform evaluations in the scenarios of speech, sound, and their composites. The results demonstrate that PlanAudio generally outperforms the existing pipeline and unified baselines, while staying competitive with models designed for a single scenario. Our analysis further reveals the superiority of semantic latent CoT over other CoT mechanisms and highlights the importance of continuous multi-scenario training curricula.

2605.28035 2026-05-28 cs.AI cs.MM cs.SD 版本更新

MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation

MTAVG-Bench 2.0:诊断多说话人音视频生成中电影表现力的失败模式

Haitian Li, Yanghao Zhou, Heyan Huang, Liangji Chen, YiMing Cheng, Xu Liu, Dian Jin, Jiajun Xu, Jingyun Liao, Tian Lan, Ziqin Zhou, Yueying Liu, Yu Bai, Changsen Yuan, Jinxing Zhou, Xian-Ling Mao, Xuefeng Chen, Yousheng Feng

发表机构 * Shanghai University(上海大学) Beijing Institute of Technology(北京理工大学) Shanghai Film Academy(上海电影学院) Tsinghua University(清华大学) Hefei University of Technology(合肥工业大学) Inkeverse Group Limited(Inkeverse集团有限公司) The University of Adelaide(阿德莱德大学) Beijing University of Technology(北京工业大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院) OpenNLP Lab(OpenNLP实验室)

AI总结 针对多说话人音视频生成中电影表现力评估不足的问题,提出MTAVG-Bench 2.0基准,通过构建涵盖表演、叙事、氛围和视听语言的高层次失败分类体系及超过1万个问答实例,系统评估全模态大语言模型诊断复杂视听失败的能力。

详情
AI中文摘要

近年来,多说话人音视频生成(MTAVG)模型在唇形同步和视听对齐等基本指标上表现出了有前景的性能。然而,这些指标仍不足以评估场景级生成中的电影表现力。在多角色场景中,生成模型必须超越视听真实感,传达连贯的角色表演及其他更高层次的电影品质。为填补这一空白,我们引入了MTAVG-Bench 2.0,这是一个用于诊断多说话人音视频生成中电影表现力失败模式的基准。与先前主要关注基本多轮对话质量的设置不同,MTAVG-Bench 2.0针对短剧和场景级生成,并建立了一个涵盖表演、叙事、氛围和视听语言的高层次失败分类体系。基于该分类体系,我们构建了超过1万个问答评估实例,以及用于短剧级评估和失败模式时间定位的子集,以系统评估全模态大语言模型诊断高层次视听失败的能力。实验结果表明,Gemini等商业全模态模型显著优于其他评估器,但即使是最强的模型在我们的基准中仍难以应对复杂失败。这些结果证明,MTAVG-Bench 2.0为电影级多说话人音视频生成中的失败诊断提供了一个系统化的基准。

英文摘要

In recent years, Multi-Talker Audio-Video Generation (MTAVG) models have shown promising performance on fundamental metrics such as lip-sync and audio-visual alignment. However, these metrics remain insufficient for assessing cinematic expressiveness in scene-level generation. In multi-character scenes, generation models must go beyond audio-visual realism to convey coherent character performance and other higher-level cinematic qualities. To fill this gap, we introduce MTAVG-Bench 2.0, a benchmark for diagnosing failure modes of cinematic expressiveness in multi-talker audio-video generation. Unlike prior settings that mainly focus on the quality of basic multi-turn dialogue, MTAVG-Bench 2.0 targets short-drama and scene-level generation, and establishes a high-level failure taxonomy spanning acting, narrative, atmosphere, and audio-visual language. Based on this taxonomy, we construct more than 10,000 question-answering evaluation instances, together with subsets for short-drama-level assessment and temporal localization of failure modes, to systematically evaluate the ability of omni large language models to diagnose high-level audio-visual failures. Experimental results show that commercial omni models such as Gemini substantially outperform other evaluators, yet even the strongest models continue to struggle with complex failures in our benchmark. These results demonstrate that MTAVG-Bench 2.0 provides a systematic benchmark for failure diagnosis in cinematic multi-talker audio-video generation.

2605.27976 2026-05-28 cs.SD 版本更新

VoiceGiraffe: A Benchmark for Extreme Long-Context Audio-Language Understanding

VoiceGiraffe: 极端长上下文音频-语言理解的基准

Jashin Ye, Dongxiao Wang, Yixuan Ye, Sashuai Zhou, Weihuang Lin, Mingyang Han, Kunpeng Wang, Zeyu Yuan, Boyu Li, Haoxiang Shi, Jingchen Shu, Jun Song, Bo Zheng

发表机构 * Future Living Lab(未来生活实验室) Alibaba(阿里巴巴)

AI总结 提出VoiceGiraffe基准,通过1500个三元组和双层分类法评估大音频语言模型在长上下文场景下的单跳感知与多跳推理能力,揭示模型在长距离记忆持久性上的瓶颈。

Comments Benchmark Project: https://github.com/LivingFutureLab/VoiceGiraffe

详情
AI中文摘要

尽管大音频语言模型(LALMs)在秒级或分钟级的音频处理中取得了显著进展,理解小时级音频仍然是一个根本性的瓶颈。现有的基准主要依赖短片段或人工拼接的片段,未能真实评估LALM在播客和长篇演讲等真实场景中的长距离信息理解能力。为填补这一空白,我们引入了VoiceGiraffe,这是一个新颖的基准,旨在严格评估LALMs在长上下文设置下跨多种真实场景、模态和语言的表现。它包含1500个精心策划的三元组,结构化为单跳感知和多跳推理的双层分类法。我们评估了一系列开源和专有LALMs与人类表现的对比。结果强调了三个基本发现。首先,VoiceGiraffe仍然极具挑战性,远未饱和。其次,我们表明没有单一的推理范式普遍占优。端到端推理有利于具有原生长上下文音频理解的模型,级联字幕聚合稳定了被小时级音频淹没的小模型,而借助外部LLM的推理增强级联有助于较弱的模型,但可能成为较强专有系统的瓶颈。第三,我们揭示了长距离记忆持久性是一个关键瓶颈。LALMs在回答需要连接显著因果线索的问题时表现更好,而在需要跨长音频持续跟踪稀疏事件的问题上表现较差,而人类则表现出相反的模式。这些发现使VoiceGiraffe成为长格式音频理解的一个具有挑战性和诊断性的测试平台,突显了需要具有持久记忆和稳健长距离聚合能力的LALMs。

英文摘要

While large audio language models (LALMs) have achieved remarkable progress in audio processing at the second- or minute-level scale, understanding hour-level audio remains a fundamental bottleneck. Existing benchmarks predominantly rely on short clips or artificially concatenated segments, failing to faithfully assess LALM capacity for long-range information comprehension in real-world scenarios such as podcasts and lengthy speeches. To address this gap, we introduce VoiceGiraffe, a novel benchmark designed to rigorously evaluate LALMs across diverse real-world scenarios, modalities, and languages under long-context settings. It comprises 1500 curated triplets structured into a dual-level taxonomy of single-hop perception and multi-hop reasoning. We evaluate a broad suite of open-source and proprietary LALMs against human performance. Results underscore three fundamental findings. First, VoiceGiraffe remains highly challenging and far from saturation. Second, we show that no single inference paradigm universally dominates. The E2E inference benefits models with native long-context audio understanding, cascaded caption aggregation stabilizes small models overwhelmed by hour-scale audio, and reasoning-enhanced cascading with external LLM helps weaker models but can bottleneck stronger proprietary systems. Third, we reveal long-range memory persistence as a key bottleneck. LALMs are better at answering questions that require connecting salient causal cues than those requiring sustained tracking of sparse events across long audio, whereas humans show the opposite pattern. These findings position VoiceGiraffe as a challenging and diagnostic testbed for long-form audio understanding, highlighting the need for LALMs with persistent memory and robust long-range aggregation.

2605.27944 2026-05-28 cs.AI cs.MM cs.SD 版本更新

From Talking to Singing: A New Challenge for Audio-Visual Deepfake Detection

从说话到唱歌:音视频深度伪造检测的新挑战

Ke Liu, Jiwei Wei, Wenyu Zhang, Shuchang Zhou, Ruikun Chai, Yutao Dai, Chaoning Zhang, Yang Yang

发表机构 * Center for Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China(未来媒体中心,计算机科学与工程学院,电子科技大学)

AI总结 针对现有音视频深度伪造检测方法在唱歌场景中性能下降的问题,提出文本引导的音视频伪造检测框架(T-AVFD),通过面部真实性模式学习和多模态差异权重学习,在说话和唱歌场景中均实现鲁棒检测。

Comments Accepted by ICML 2026

详情
AI中文摘要

随着音视频生成模型的快速发展,可靠的伪造检测变得日益关键。现有的音视频深度伪造检测方法通常依赖于跨模态不一致性。在唱歌中,有节奏的发声削弱了这种耦合,并引入了显著的领域偏移,大幅降低了检测性能。我们使用节奏感知生成模型构建了唱歌头部深度伪造(SHDF)数据集,以填补唱歌基准的空白。为了应对跨场景领域偏移,我们提出了文本引导的音视频伪造检测(T-AVFD)框架,该框架在说话和唱歌场景中均具有泛化能力。T-AVFD 包含一个面部真实性模式学习器和一个多模态差异权重学习模块。模式学习器将面部特征与多粒度文本描述对齐,以学习可泛化的真实性模式。权重学习模块保留固有的音视频一致性,并通过差异权重将其与真实性模式自适应地整合。在多个说话头部深度伪造数据集和 SHDF 上的大量实验表明,该方法在现有基线上取得了一致的改进,并在多种扰动下表现出强大的鲁棒性。

英文摘要

With rapid advances in audio-visual generative models, reliable forgery detection becomes increasingly critical. Existing methods for audio-visual deepfake detection typically rely on cross-modal inconsistencies. In singing, rhythmic vocalization weakens this coupling and introduces a nontrivial domain shift, substantially degrading detection performance. We construct the Singing Head DeepFake (SHDF) dataset using rhythm-aware generative models to fill the gap in singing benchmarks. To cope with cross-scenario domain shifts, we propose a Text-guided Audio-Visual Forgery Detection (T-AVFD) framework that generalizes across both talking and singing scenarios. T-AVFD comprises a facial authenticity pattern learner and a multi-modal differential weight learning module. The pattern learner aligns facial features with multi-granularity textual descriptions to learn generalizable authenticity patterns. The weight learning module preserves intrinsic audio-visual consistency and adaptively integrates it with authenticity patterns via differential weighting. Extensive experiments on multiple talking head deepfake datasets and SHDF show consistent improvements over existing baselines and strong robustness under diverse perturbations.

2605.27840 2026-05-28 eess.AS cs.AI cs.SD 版本更新

LoSATok: Low-dimensional Semantic-Acoustic Tokenizer for Cross-Domain Audio Understanding and Generation

LoSATok: 用于跨域音频理解与生成的低维语义-声学分词器

Zhisheng Zhang, Xiang Li, Yixuan Zhou, Jing Peng, Guoyang Zeng, Zhiyong Wu

发表机构 * Shenzhen International Graduate School, Tsinghua University, China(清华大学深圳国际研究生院,中国) ModelBest Inc., China(ModelBest公司,中国)

AI总结 提出低维音频分词器LoSATok,通过语义瓶颈压缩和双级语义监督,在紧凑潜空间中联合捕获语义和声学细节,提升扩散Transformer的生成性能。

详情
AI中文摘要

音频分词器是统一音频理解和生成的基础。理解需要高层语义,而生成需要语义和声学细节。现有的统一分词器将两者共同编码到高维连续潜变量中,这增加了扩散Transformer(DiT)的建模负担。我们提出LoSATok,一种用于跨域音频理解和生成的低维音频分词器。受1280维语义编码器特征可压缩的观察启发,我们引入语义瓶颈(Semantic Bottleneck),将其压缩到128维,并通过提出的时间关系损失(time-relation loss)正则化以实现时间特征一致性。我们进一步设计了一种双级语义监督方法,利用高维和低维语义信号,使分词器能够在紧凑的潜空间中联合捕获语义和声学细节。在语音、音乐和通用音频上的实验表明,SemBo保持了强大的低维语义能力,LoSATok与几种语义表示相比保持了有竞争力的理解性能,同时在语音、音乐和音频生成上持续提升了DiT的建模性能。这些结果表明,LoSATok的低维表示能够有效支持音频理解和生成。我们的代码提供在https://github.com/wxzyd123/LoSATok。

英文摘要

Audio tokenizers are fundamental to unifying audio understanding and generation. Understanding requires high-level semantics, while generation demands semantic and acoustic details. Existing unified tokenizers jointly encode both in high-dimensional continuous latents, which increases the modeling burden of Diffusion Transformers (DiTs) for generation. We propose LoSATok, a low-dimensional audio tokenizer for cross-domain audio understanding and generation. Motivated by the observation that 1280-dimensional semantic encoder features are compressible, we introduce a Semantic Bottleneck that compresses them into 128 dimensions, regularized by the proposed time-relation loss for temporal feature consistency. We further design a dual-level semantic supervision method that leverages both high- and low-dimensional semantic signals, enabling the tokenizer to jointly capture semantics and acoustic details within a compact latent space. Experiments on speech, music, and general audio show that SemBo preserves strong low-dimensional semantic capacity and LoSATok retains competitive understanding performance compared with several semantic representations, while consistently improving DiT modeling performance on speech, music, and audio generation. These results demonstrate that LoSATok's low-dimensional representations can effectively support audio understanding and generation. Our code is provided at https://github.com/wxzyd123/LoSATok.

2605.27838 2026-05-28 cs.SD 版本更新

Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text

Dasheng AudioGen: 从文本生成连贯音频场景的统一模型

Jiahao Mei, Heinrich Dinkel, Yadong Niu, Xingwei Sun, Gang Li, Yifan Liao, Jiahao Zhou, Junbo Zhang, Jian Luan, Mengyue Wu

发表机构 * X-LANCE Lab, Shanghai Jiao Tong University, Shanghai, China(上海交通大学XLANCE实验室,上海,中国) MiLM Plus, Xiaomi Inc., Beijing, China(小米公司MiLM Plus,北京,中国)

AI总结 提出Dasheng AudioGen统一框架,通过结构化多视角描述和高维统一语义-声学表示,实现从文本到混合音频场景的端到端生成。

详情
AI中文摘要

音频生成长期以来一直分散,语音、音乐和音效由特定领域的模型生成,无法从单一描述联合生成连贯的音频场景。关键障碍在于对真实世界混合音频缺乏细粒度监督,以及用于建模并发音频组件的声学表示有限。我们提出了Dasheng AudioGen,一个从文本生成通用混合音频场景的统一框架。Dasheng AudioGen引入了结构化多视角描述,将复杂声学场景显式解耦为互补的描述视角,从而实现对音频层的细粒度控制。此外,我们采用高维统一语义-声学表示作为共享潜在空间。它注入语义先验,促进跨模态训练收敛,同时其高维特征空间提供足够容量以有效解耦和融合并发音频组件。通过这些设计,一个简单的流匹配DiT实现了高质量端到端音频场景生成。我们还为音频场景生成建立了全面的评估流程。实验表明,Dasheng AudioGen在混合音频类别中实现了接近真实录音的性能,同时在单类型生成任务中与专门模型保持竞争力。演示可在https://nieeim.github.io/Dasheng-AudioGen-Web/获取。

英文摘要

Audio generation has long been fragmented, with speech, music, and sound effects produced by domain-specific models that fail to jointly generate coherent audio scenes from a single description. The key obstacles are insufficient fine-grained supervision for real-world mixed audio and limited acoustic representations for modeling concurrent audio components. We present Dasheng AudioGen, a unified framework for generating general mixed-audio scenes from text. Dasheng AudioGen introduces structured multi-view captions, which explicitly decouple complex acoustic scenes into complementary description views, thereby enabling fine-grained control over audio layers. Furthermore, we employ a high-dimensional unified semantic-acoustic representation as the shared latent space. It injects semantic priors that facilitate cross-modal training convergence, while its high-dimensional feature space provides sufficient capacity to disentangle and fuse concurrent audio components effectively. With these designs, a simple flow-matching DiT achieves high-quality end-to-end audio scene generation. We also establish a comprehensive evaluation pipeline for audio scene generation. Experiments demonstrate that Dasheng AudioGen achieves performance approaching real-world recordings in mixed-audio categories, while remaining competitive with specialized models in single-type generation tasks. Demos are available at https://nieeim.github.io/Dasheng-AudioGen-Web/.

2605.27772 2026-05-28 cs.SD cs.LG 版本更新

Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox

音频大语言模型是听还是读?使用VoxParadox分析和缓解副语言失败

Jiacheng Pang, Ashutosh Chaubey, Mohammad Soleymani

发表机构 * Institute for Creative Technologies, University of Southern California, Los Angeles, USA(创意技术研究所,南加州大学,洛杉矶,美国)

AI总结 针对音频大语言模型在副语言理解上的不足,提出对抗性基准VoxParadox和Prompt-Conditioned Layer Mixer方法,显著提升模型对副语言线索的利用能力。

Comments Accepted as a conference paper at ICML 2026. Project page: https://voxparadox.github.io/

详情
AI中文摘要

音频大语言模型(Audio LLMs)在语音理解任务上表现出色,但其理解副语言信息的能力仍然有限。为了系统量化这一问题,我们引入了VoxParadox,一个包含2000个验证示例的对抗性基准,涵盖10项副语言任务,通过受控语音合成故意使转录声明与说话风格不匹配,从而直接测量语音副语言理解能力。对多种音频大语言模型的评估显示,它们在声学真实情况上的准确率持续较低,并且强烈倾向于遵循语言暗示的(错误)答案。为了理解这一差距的原因,我们进行了逐层探针分析,发现(i)副语言线索可能在更深的编码器层以及编码器-大语言模型接口处退化,(ii)即使音频令牌中存在这些线索,语言模型也经常忽略它们。为了解决这些问题,我们提出了提示条件层混合器(PCLM),它根据输入提示自适应地组合多个音频层的信息,并结合直接偏好优化(DPO)来明确偏好声学支持的选项而非语言暗示的选项。这些方法显著提升了音频大语言模型的副语言理解能力,在VoxParadox上将Audio Flamingo 3从17.40%提升至65.20%,在MMSU副语言子集上从37.74%提升至54.78%。我们的项目页面位于https://voxparadox.github.io/。

英文摘要

Audio large language models (Audio LLMs) demonstrate strong performance on speech understanding tasks, yet their ability to understand paralinguistic information remains limited. To systematically quantify this issue, we introduce VoxParadox, an adversarial benchmark with 2,000 verified examples, spanning 10 paralinguistic tasks, created with controlled speech synthesis to intentionally mismatch transcript claims and speaking style, enabling direct measurement of speech paralinguistic understanding. Evaluation of a diverse set of Audio LLMs reveals consistently low accuracy on acoustic ground truth and a strong tendency to follow language-implied (incorrect) answers. To understand the cause of this gap, we perform layer-wise probing and find that (i) paralinguistic cues can degrade in deeper encoder layers and at the encoder--LLM interface, and (ii) even when such cues are available in audio tokens, the language model frequently ignores them. To address these problems, we propose Prompt-Conditioned Layer Mixer (PCLM), which adaptively combines information from multiple audio layers based on the input prompt, and pair it with Direct Preference Optimization (DPO) to explicitly prefer acoustically supported options over language-implied alternatives. These methods substantially improve Audio LLM paralinguistic understanding, improving Audio Flamingo 3 from 17.40% to 65.20% on VoxParadox, and from 37.74% to 54.78% on MMSU paralinguistic subset. Our project page is available at https://voxparadox.github.io/.

2605.27258 2026-05-28 cs.SD cs.AI 版本更新

PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis

PilotTTS:一种有纪律的模块化配方用于竞争性语音合成

Bowen Li, Shaotong Guo, Zhen Wang, Yang Xiang, Mingli Jin, Yihang Lin, Jiahui Zhao, Weibo Xiong, Dongrui Zhang, Keming Chen, Yunze Gao, Zeyang Lin, Yuze Zhou, Yue Liu

发表机构 * Amap, Alibaba Group(阿里巴巴集团爱马仕部门) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出PilotTTS轻量级自回归TTS系统,通过极简架构和严格数据工程(仅用20万小时开源处理数据)实现竞争性能,支持零样本语音克隆、情感/副语言/方言合成,在Seed-TTS Eval基准上取得最低WER和最高说话人相似度。

详情
AI中文摘要

构建最先进的文本转语音(TTS)系统通常需要数百万小时的专有数据和复杂的多阶段架构,这给资源受限的研究团队带来了巨大障碍。在本报告中,我们提出了PilotTTS,一种轻量级自回归TTS系统,通过极简架构和严格的数据工程实现了竞争性能。PilotTTS仅使用20万小时的数据进行训练,这些数据完全通过开源工具处理。具体来说,我们的贡献包括:(1)一个可复现的多阶段数据处理流水线,涵盖质量评估、标签标注和过滤;(2)一个紧凑的模型架构,采用基于Q-Former的条件化,通过跨样本配对训练将说话人身份与说话风格解耦。在统一框架内,PilotTTS支持零样本语音克隆、情感合成(11类)、副语言合成(4类)和中文方言合成(14种方言)。在Seed-TTS Eval基准上,PilotTTS在test-en上实现了最低的WER 1.50%,在test-zh上实现了CER 0.87%,并在两个测试集上取得了最高的说话人相似度(0.862和0.815),优于使用更大数据集训练的系统。我们在https://github.com/AMAPVOICE/PilotTTS上发布了完整的数据流水线配方、预训练权重和代码。

英文摘要

Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage architectures, creating substantial barriers for resource-constrained research teams. In this report, we present PilotTTS, a lightweight autoregressive TTS system that achieves competitive performance through minimalist architecture and rigorous data engineering. PilotTTS is trained on only 200K hours of data processed entirely with open-source tools. Specifically, our contributions are: (1) a reproducible multi-stage data processing pipeline covering quality assessment, label annotation, and filtering, and (2) a compact model architecture that employs Q-Former-based conditioning to decouple speaker identity from speaking style via cross-sample paired training. Within a unified framework, PilotTTS supports zero-shot voice cloning, emotion synthesis (11 categories), paralinguistic synthesis (4 categories), and Chinese dialect synthesis (14 dialects). On the Seed-TTS Eval benchmark, PilotTTS achieves the lowest WER of 1.50% on test-en, a CER of 0.87% on test-zh, and the highest speaker similarity on both test sets (0.862 and 0.815), outperforming systems trained on significantly larger datasets. We release the complete data pipeline recipe, pretrained weights, and code at https://github.com/AMAPVOICE/PilotTTS.

2605.16578 2026-05-28 cs.SD cs.AI cs.HC cs.LG 版本更新

Voice "Cloning" is Style Transfer

语音“克隆”是风格迁移

Kaitlyn Zhou, Federico Bianchi, Martijn Bartelds, Anna Pot, Yongchan Kwon, James Zou

发表机构 * Cornell University(康奈尔大学) TogetherAI Stanford University(斯坦福大学)

AI总结 研究发现语音克隆并非忠实复制原声,而是系统性地应用风格迁移,使克隆语音更权威、温暖、客服化且更人性化,导致说话者特征同质化,并影响人类信任与行为。

详情
AI中文摘要

人工生成的语音日益嵌入日常生活。语音克隆尤其适用于身份保留重要的应用,例如完成录音、用新语言配音或保存失语者的声音。然而,在我们的工作中,我们发现尽管术语如此,语音克隆并不能忠实地“克隆”个体的声音。相反,我们发现广泛使用的语音克隆模型系统性地对源语音应用风格迁移。根据人类标注者的评分,克隆语音相比源语音被认为更权威、更温暖、更接近客服风格且更人性化。人类标注者还报告对克隆语音的信任度高于源语音,并且更愿意向它们透露敏感个人信息。我们的工作还表明,语音克隆导致说话者特征的同质化,表现为口音、语速和音频嵌入空间的方差减小。总之,我们的结果凸显了语音克隆技术的一系列新局限和风险,及其对人类行为的潜在影响。

英文摘要

Artificially generated speech is increasingly embedded in everyday life. Voice cloning in particular enables applications where identity preservation is important, such as completing a recording, dubbing in a new language, or preserving the voices of individuals with speech loss. However, in our work, we find that despite the term, voice cloning does not faithfully ''clone'' an individual's voice. Instead, we find that widely-used voice cloning models systematically apply style transfer to source voices. As rated by human annotators, cloned voices are perceived as more authoritative, warm, customer-service-like, and human-like compared to their sources. Human annotators also report greater trust in cloned voices than source voices, and a greater willingness to disclose sensitive personal information to them. Our work furthermore shows that voice cloning leads to homogenization of speaker characteristics, as measured by reduced variance in accent, speaking rate, and the audio embedding space. Together, our results highlight a new set of limitations and risks of voice cloning technology and their potential impact on human behavior.

2605.13931 2026-05-28 eess.AS cs.SD 版本更新

FSD50K-Solo: Automated Curation of Single-Source Sound Events

FSD50K-Solo:单源声音事件的自动策展

Ningyuan Yang, Sile Yin, Li-Chia Yang, Bryce Irvin, Xiao Quan, Marko Stamenovic, Shuo Zhang

发表机构 * Electrical \& Computer Engineering Stony Brook University Stony Brook, NY, USA Research Bose Corporation Framingham, MA, USA sile\ Research Bose Corporation Framingham, MA, USA richard\ Research Bose Corporation Framingham, MA, USA bryce\ Research Bose Corporation Framingham, MA, USA xiao\ Research Bose Corporation Framingham, MA, USA marko\ Research Bose Corporation Framingham, MA, USA shuo\

AI总结 提出一种基于生成扩散模型和预训练音频编码器的数据策展框架,自动识别并过滤多源样本,从FSD50K中构建单源声音事件数据集FSD50K-Solo。

Comments Accepted to EUSIPCO 2026. 5 pages, 3 figures

详情
AI中文摘要

高质量的训练数据集对于神经网络的性能至关重要。然而,音频领域仍然缺乏大规模、强标注的单源声音事件数据集。FSD50K数据集虽然相对较大且开放,但包含相当比例的多源样本,其中背景干扰或重叠事件可能限制数据的实用性。为了解决这一挑战,我们引入了一个为大规模开放音频语料库设计的数据策展框架。我们的方法利用生成扩散模型合成干净的单一类别事件,以构建受控的噪声混合用于监督。随后,我们采用预训练音频编码器结合判别分类器自动识别并过滤多源样本。实验表明,我们的框架在由人类专家策展的测试集上取得了强劲的性能。最后,我们发布了FSD50K-Solo,这是FSD50K的一个由模型策展的子集,包含由我们的方法识别的单源音频样本。除了FSD50K,我们的方法为策展开源音频语料库建立了一个可扩展的范式。

英文摘要

High-quality training datasets are essential for the performance of neural networks. However, the audio domain still lacks a large-scale, strongly-labeled, and single-source sound event dataset. The FSD50K dataset, despite being relatively large and open, contains a considerable fraction of multi-source samples where background interference or overlapping events could limit the usefulness of the data. To address this challenge, we introduce a data curation framework designed for large-scale open audio corpora. Our approach leverages a generative diffusion model to synthesize clean single-class events to construct controlled noisy mixtures for supervision. We subsequently employ a pre-trained audio encoder coupled with a discriminative classifier to automatically identify and filter out multi-source samples. Experiments show that our framework achieves strong performance on a human expert-curated test set. Finally, we release FSD50K-Solo, a model-curated subset of FSD50K containing single-source audio samples identified by our method. Beyond FSD50K, our method establishes a scalable paradigm for curating open source audio corpora.

2601.04876 2026-05-28 cs.SD 版本更新

ChronosAudio: A Comprehensive Long-Audio Benchmark for Evaluating Audio-Large Language Models

ChronosAudio: 用于评估音频大语言模型的综合长音频基准

Kaiwen Luo, Liang Lin, Yibo Zhang, Moayad Aloqaily, Jialiang Tao, Dexian Wang, Zhenhong Zhou, Junwei Zhang, Kun Wang, Li Sun, Qingsong Wen

发表机构 * Nanyang Technological University(南洋理工大学) Independent Researcher(独立研究者) United Arab Emirates University(阿联酋大学) North China Electric Power University(华北电力大学) Southwest Jiaotong University(西南交通大学) Squirrel AI Learning(Squirrel AI学习)

AI总结 提出首个针对音频大语言模型长音频理解的多任务基准ChronosAudio,包含6大任务类别和36000个测试实例,实验发现模型存在长上下文崩溃、注意力稀释等问题,现有缓解策略仅恢复50%性能。

详情
AI中文摘要

尽管音频大语言模型(ALLMs)取得了显著进展,但其长音频理解能力仍未得到探索。针对通用音频任务,已有大量基准被提出,但它们主要关注短视频片段,缺乏评估ALLMs在长时间跨度上的共识。本文提出ChronosAudio,这是首个针对ALLMs长音频理解定制的多任务基准。它包含六大任务类别,共36000个测试实例,总时长超过200小时,并按短、中、长三类进行分层,以全面评估长度泛化能力。使用ChronosAudio对16个最先进模型进行的广泛实验得出了三个关键发现:1. 急剧的长上下文崩溃:ALLMs表现出严重的性能维持能力不足,从短上下文过渡到长上下文时,特定任务的性能下降超过90%。2. 结构性注意力稀释:性能下降源于维持时间局部性的根本失败;注意力机制在后续序列中遭受显著扩散。3. 缓解措施的效果上限:当前策略仅能恢复50%的性能。这些发现揭示了长音频中的重大挑战,强调了迫切需要实现稳健的文档级音频推理的方法。

英文摘要

Although Audio Large Language Models (ALLMs) have witnessed substantial advancements, their long audio understanding capabilities remain unexplored. A plethora of benchmarks have been proposed for general audio tasks, they predominantly focus on short-form clips, leaving without a consensus on evaluating ALLMs over extended durations. This paper proposes ChronosAudio, the first multi-task benchmark tailored for long-audio understanding in ALLMs. It encompasses six major task categories and comprises 36,000 test instances totaling over 200 hours audio, stratified into short, middle, and long-form categories to comprehensively evaluate length generalization. Extensive experiments on 16 state-of-the-art models using ChronosAudio yield three critical findings: 1.Precipitous Long-Context Collapse: ALLMs exhibit a severe inability to sustain performance, with the transition from short to long contexts triggering a staggering performance degradation of over 90% in specific tasks. 2.Structural Attention Dilution: Performance degradation stems from a fundamental failure in maintaining temporal locality; attention mechanisms suffer from significant diffusion in later sequences. 3.Restorative Ceiling of Mitigation: Current strategies only offer 50% recovery. These findings reveal significant challenges in long-audio, underscoring the urgent need for approaches to achieve robust, document-level audio reasoning.

2512.09786 2026-05-28 cs.LG cs.PF cs.SD eess.AS eess.SP 版本更新

TinyDéjàVu: Smaller RAM and Faster Inference with Neural Networks on MCUs for Sensor Data Streams

TinyDéjàVu:用于传感器数据流的微控制器上更小RAM和更快推理的神经网络

Zhaolan Huang, Emmanuel Baccelli

发表机构 * Inria Saclay, France(法国萨克利研究所) Inria, France(法国研究所)

AI总结 提出TinyDéjàVu框架,通过优化神经网络推理中的数据流,在微控制器上实现高达90%的RAM节省和相同计算延迟,用于传感器数据时间序列的推理。

详情
AI中文摘要

嵌入式智能的例子包括用于无线传感器和执行器上的各种微型神经网络,这些网络预期持续对感知数据的时间序列进行推理。为了满足电池供电时的寿命和能耗要求,此类硬件完全基于微控制器,并尽可能少的内存,例如128 kB的RAM。在此背景下,优化推理过程中神经网络层间的数据流变得至关重要。在本文中,我们介绍了一个新框架TinyDéjàVu以及我们设计的新算法,旨在大幅减少在典型微控制器硬件上使用各种神经网络模型对传感器数据时间序列进行推理所需的RAM预算。我们将TinyDéjàVu的实现开源,并在常见的微控制器硬件(Arm Cortex-M)上进行可重复的基准测试。我们表明,与先前工作(StreamiNNC)相比,在重叠滑动窗口输入上,TinyDéjàVu可以节省高达90%的RAM使用,同时计算延迟相同。

英文摘要

Examples of embedded intelligence include a wide variety of tiny neural networks used on-board wireless sensors and actuators, which are expected to continuously perform inference on time-series of the data they sense. In order to fit lifetime and energy consumption requirements when operating on battery, such hardware is exclusively based on microcontroller with as little memory as possible, e.g., 128 kB of RAM. In this context, optimizing data flows during inference across neural network layers becomes crucial. In this paper, we introduce a new framework, TinyDéjàVu, and novel algorithms we designed to drastically reduce the RAM budget required by inference using various neural network models for sensor data time-series on typical microcontroller hardware. We publish the implementation of TinyDéjàVu as open source, and we perform reproducible benchmarks on common microcontroller hardware (Arm Cortex-M). We show that TinyDéjàVu can save up to 90\% of RAM usage with equal compute latency compared to prior work (StreamiNNC) on overlapping sliding window inputs.

2511.05550 2026-05-28 cs.SD cs.CL cs.LG 版本更新

Assessing Factual Music Comprehension in Large Audio Language Models

评估大型音频语言模型中的事实音乐理解能力

Daniel Chenyu Lin, Michael Freeman, John Thickstun

AI总结 针对现有MusicQA数据集无法衡量模型回答事实正确性的问题,提出基于可验证信息的评估协议,通过精确率、召回率和F1分数客观评估模型,并在三个数据集上定义六项事实检索任务,对九个最新LALM进行基准测试。

Comments 16 pages; second submission

详情
AI中文摘要

大型音频语言模型(LALMs)利用多模态表示生成对音频自然语言查询的开放式回答。本文(1)提供经验证据表明,使用流行的MusicQA数据集评估LALMs无法衡量模型关于音乐的回答是否事实正确,(2)开发了一种新的评估LALMs音乐理解能力的协议。具体来说,我们提出一个评估协议,提示LALM提供可事实验证的信息,并将其开放式回答解析为结构化格式,使用精确率、召回率和F1分数进行客观评估。利用该协议,我们定义了一个基准测试,包含在三个不同数据集(MusicNet、Free Music Archive和OverClocked ReMix)上定义的六项事实信息检索任务。我们对九个最近的LALMs进行了基准测试,包括前沿模型如Gemini和最新的开放模型如Music Flamingo,并在https://github.com/DCL2004/LALM-Eval发布了评估脚本套件,以方便新LALMs的基准测试。

英文摘要

Large audio language models (LALMs) leverage multimodal representations to generate open-ended answers to natural language queries about audio. In this paper, we (1) provide empirical evidence that assessment of LALMs using the popular MusicQA dataset fails to measure whether a model's responses about music are factually correct, and (2) develop a new protocol for assessing the music comprehension capabilities of LALMs. Specifically, we propose an evaluation protocol that prompts a LALM for factually verifiable information, and parses its open-ended response into a structured format that can be objectively assessed using Precision, Recall, and F1 scores. Using this protocol, we define a benchmark consisting of six factual information retrieval tasks defined on three diverse datasets: MusicNet, the Free Music Archive, and OverClocked ReMix. We benchmark nine recent LALMs, including frontier models like Gemini and the latest open models like Music Flamingo, and release the suite of evaluation scripts at https://github.com/DCL2004/LALM-Eval to facilitate benchmarking of new LALMs.

2506.08846 2026-05-28 cs.CY cs.CL cs.SD eess.AS 版本更新

Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia

自动语音识别技术审计实践中的陷阱:以失语症患者为例

Katelyn Xiaoying Mei, Anna Seo Gyeong Choi, Hilke Schellmann, Mona Sloane, Allison Koenecke

发表机构 * University of Washington(华盛顿大学) Cornell University(康奈尔大学) New York University(纽约大学) University of Virginia(弗吉尼亚大学)

AI总结 本文识别了标准ASR审计中的三个常见陷阱,并提出了一个整体审计框架,通过失语症患者的案例研究发现ASR系统对其表现更差。

Comments Published at the Proceedings of The 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)

详情
AI中文摘要

自动语音识别(ASR)系统的日益普及需要稳健的审计方法,以确保转录质量的公平性,特别是对于像失语症这样的言语障碍患者,他们不成比例地依赖ASR。虽然学术和行业审计揭示了不同用户群体之间的性能差异,但标准审计实践常常忽视可能掩盖对边缘群体伤害的细微差别。我们识别了标准ASR审计中的三个常见陷阱:(1)坚持单一的文本标准化方法,这可能掩盖ASR性能的差异并忽视边缘社区的标准化偏好;(2)展示高层次的人口统计发现,而不考虑按细微交叉亚组划分的性能差异,或依赖于相关的声学特性;(3)仅报告一个黄金标准指标(词错误率),这不足以量化常见的生成式AI错误,如幻觉。我们提出了一个解决这些陷阱的整体审计框架,并在对六个流行ASR系统的案例研究中发现,与对照组相比,失语症患者的ASR性能持续更差。我们呼吁从业者实施这些更适合快速变化的ASR环境的稳健、社区驱动的ASR审计实践。

英文摘要

Automatic Speech Recognition (ASR) systems' growing use warrants robust auditing approaches to ensure equitable transcription quality, especially for people with speech disorders like aphasia who disproportionately depend on ASR. While academic and industry audits have revealed performance disparities across user populations, standard auditing practices often overlook nuances that risk masking harm to marginalized groups. We identify three common pitfalls in standard ASR audits: (1) adhering to one method of text standardization, which can mask variance in ASR performance and ignore the standardization preferences of marginalized communities; (2) displaying high-level demographic findings without considering performance disparities by nuanced intersectional subgroups, or conditioning on relevant acoustic properties; and (3) reporting only one gold-standard metric (Word Error Rate), which inadequately quantifies common generative AI errors like hallucinations. We propose a holistic auditing framework addressing these pitfalls, and in a case study of six popular ASR systems, find consistently worse ASR performance for speakers with aphasia relative to a control group. We call on practitioners to implement these robust, community-driven ASR auditing practices better suited for the rapidly changing ASR landscape.

2505.17233 2026-05-28 cs.LG cs.SD eess.AS 版本更新

Semantic-Aware Interpretable Multimodal Music Auto-Tagging

语义感知可解释多模态音乐自动标注

Andreas Patakis, Vassilis Lyberatos, Spyridon Kantarelis, Edmund Dervakos, Giorgos Stamou

发表机构 * National Technical University of Athens(希腊国家技术大学)

AI总结 提出一种利用多模态音乐特征组和期望最大化算法实现可解释音乐自动标注的方法,在保持竞争性能的同时提供决策过程透明度。

Comments Accepted at Interspeech 2025

详情
AI中文摘要

音乐自动标注对于大规模数字图书馆中的音乐组织和发现至关重要。尽管基础模型在该领域取得了卓越性能,但其输出往往缺乏可解释性,限制了研究人员和最终用户的信任与可用性。在这项工作中,我们提出了一种可解释的音乐自动标注框架,该框架利用从信号处理、深度学习、本体工程和自然语言处理中导出的具有音乐意义的多模态特征组。为了增强可解释性,我们对特征进行语义聚类,并采用期望最大化算法,根据每个特征组对标注过程的贡献分配不同的权重。我们的方法在实现具有竞争力的标注性能的同时,提供了对决策过程的更深入理解,为更透明和以用户为中心的音乐标注系统铺平了道路。

英文摘要

Music auto-tagging is essential for organizing and discovering music in extensive digital libraries. While foundation models achieve exceptional performance in this domain, their outputs often lack interpretability, limiting trust and usability for researchers and end-users alike. In this work, we present an interpretable framework for music auto-tagging that leverages groups of musically meaningful multimodal features, derived from signal processing, deep learning, ontology engineering, and natural language processing. To enhance interpretability, we cluster features semantically and employ an expectation maximization algorithm, assigning distinct weights to each group based on its contribution to the tagging process. Our method achieves competitive tagging performance while offering a deeper understanding of the decision-making process, paving the way for more transparent and user-centric music tagging systems.