arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 语音翻译与语音语言模型 1 篇

2606.18273 2026-06-18 cs.CL cs.AI cs.SD eess.AS 交叉投稿

Continuous Audio Thinking for Large Audio Language Models

面向大型音频语言模型的连续音频思考

Gyojin Han, Dong-Jae Lee, Changho Choi, Jongsuk Kim, Junmo Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 提出连续音频思考(CoAT)框架,通过专家蒸馏在连续潜在空间中组织声学信息,使音频语言模型在生成响应前利用丰富声学特征,无需额外自回归解码成本,在多个音频任务上提升性能。

Comments Preprint

详情
AI中文摘要

大型音频语言模型(LALMs)在从语音转录到音乐分析等多种音频理解任务中展现了令人印象深刻的能力。然而,由于LALMs通常被训练生成与文本对齐的响应,其隐藏状态逐渐为文本生成而塑造,而非保留声学信息。因此,音频携带的多样化声学内容,如语音细节、韵律、声音事件、情感和音调,在过程中丢失,难以在响应中利用。我们引入了连续音频思考(CoAT),这是一个框架,为音频语言模型配备一个连续的潜在工作空间,用于在响应生成之前组织声学信息,并通过音频专家的蒸馏进行基础化。在思考空间内,模型可以在生成响应时利用专家蒸馏提供的丰富声学信息。此外,所提出的连续思考块可以在单个预填充中处理,因此CoAT不需要比基线额外的自回归解码成本。在三个LALM上,Qwen2-Audio、Qwen2.5-Omni-7B和Audio Flamingo~3,在涵盖音频推理、音频理解、音乐分类、语音情感和语音转录的广泛基准套件上的性能提升证明了CoAT的有效性。进一步分析证实,辅助监督从思考位置传播到模型的文本响应。

英文摘要

Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned responses, their hidden states are progressively shaped for text generation rather than for preserving acoustic information. As a result, the diverse acoustic content that audio carries, such as phonetic detail, prosody, sound events, affect, and pitch, is lost along the way and difficult to leverage in the response. We introduce Continuous Audio Thinking (CoAT), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information prior to response generation, grounded by distillation from audio experts. Within the thinking space, the model can utilize the rich acoustic information provided by expert distillation when generating its response. Furthermore, the proposed continuous thinking block can be processed in a single prefill, so CoAT does not require additional autoregressive decoding cost over the baseline. Across three LALMs, Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3, performance gains on a broad benchmark suite spanning audio reasoning, audio understanding, music classification, speech emotion, and speech transcription demonstrate the effectiveness of CoAT. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model's textual responses.

2. 多模态音频与视听学习 1 篇

2606.19341 2026-06-18 cs.CV cs.CL cs.SD 交叉投稿

Native Active Perception as Reasoning for Omni-Modal Understanding

原生主动感知作为全模态理解的推理

Zhenghao Xing, Ruiyang Xu, Yuxuan Wang, Jinzheng He, Ziyang Ma, Qize Yang, Yunfei Chu, Jin Xu, Junyang Lin, Chi-Wing Fu, Pheng-Ann Heng

发表机构 * The Chinese University of Hong Kong(香港中文大学) Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) Qwen Team, Alibaba Group(阿里巴巴集团Qwen团队)

AI总结 提出OmniAgent,一种基于POMDP迭代观察-思考-行动循环的原生全模态智能体,通过主动感知将推理复杂度与视频时长解耦,在多个基准上达到开源模型最优性能。

Comments Accepted at ICML 2026. Code and models: https://github.com/harryhsing/omniagent

详情
AI中文摘要

用于长视频理解的被动模型通常依赖于“全看一遍”范式,无论查询难度如何都统一处理帧,导致计算成本随视频时长增长。尽管出现了交互式框架,但它们通常依赖于全局预扫描,其上下文成本仍随视频长度扩展。我们提出OmniAgent,第一个原生全模态智能体,将视频理解建模为基于POMDP的迭代观察-思考-行动循环。OmniAgent执行按需动作,选择性地将视听线索提炼到持久文本记忆中,有效将推理复杂度与原始视频时长解耦。为实现这一点,我们引入了(1)智能体监督微调,通过最佳N轨迹合成和双阶段质量控制在启动原生主动感知;(2)带TAURA(轮次感知自适应不确定性重缩放优势)的智能体强化学习,利用轮次级熵将信用分配引导至关键发现轮次。关键的是,OmniAgent表现出正向测试时缩放,性能随推理轮次增加而提升,验证了主动感知的有效性。在十个基准(如VideoMME、LVBench)上的实验结果表明,OmniAgent在开源模型中达到了最先进性能。值得注意的是,在LVBench上,我们的7B智能体优于10倍大的Qwen2.5-VL-72B(50.5% vs. 47.3%)。

英文摘要

Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10$\times$ larger Qwen2.5-VL-72B (50.5% vs. 47.3%).

3. 其他/综合语音音频 5 篇

2606.18266 2026-06-18 cs.HC cs.AI cs.SD 交叉投稿

EMORSION: Examining the Impact of Audio Parameters on Emotional Responses and Immersion in Film

EMORSION:检验音频参数对电影中情感反应和沉浸感的影响

Nelly Garcia, Ruby Crocker, Bleiz M Del Sette, Fabrizio Smeraldi, Charalampos Saitis, George Fazekas, Joshua Reiss

发表机构 * Queen Mary University of London(伦敦大学女王学院)

AI总结 通过操纵频率、动态和方向性三个音频参数,研究电影音频设计对观众情感和沉浸感的影响,发现细微变化可改变情感感知,非常规混音增加解读变异性。

Comments AES Europe 2026

详情
AI中文摘要

EMORSION 是一项探索性概念验证研究,旨在考察电影音频设计如何在影院环境中塑造观众的情感和沉浸感。选取了恐怖片(2部)和剧情片(2部)共四个电影场景,平衡主流与独立制作。针对每个场景,通过系统操纵音频设计的三个核心方面——频率(音高)、动态(响度)和方向性(空间位置),创建了多种替代音频混音。三组观众观看场景,每组观看每个场景的一个操纵混音和一个对照混音。通过三角化多模态框架评估观众反应,包括通过问卷自我报告的情感和沉浸感、心率监测等生理测量以及基于视频的运动追踪。该协议成功捕获了不同音频条件下可测量、可解释的差异,表明即使音频设计的细微变化也能塑造情感感知和沉浸感。非常规混音往往导致观众解读的更大变异性,而常规沉浸式混音则与更强的跨观众一致性相关。这些发现确立了 EMORSION 协议的可行性,并激励更大规模的研究来表征特定音频参数在塑造观众体验中的作用。

英文摘要

EMORSION is an exploratory proof-of-concept study examining how film audio design shapes audience emotion and immersion in acinema setting. Four film scenes were selected across the horror (2) and drama (2) genres, balanced between mainstream and independent productions. For each scene, multiple alternative audio mixes were created by systematically manipulating three core aspects of audio design, frequency (pitch), dynamics (loudness), and directionality (spatial placement). Three audience groups viewed the scenes, with each group exposed to one manipulated mix alongside a control mix for each scene. Audience responses were assessed through a triangulated multimodal framework combining self-reported emotion and immersion via a questionnaire, physiological measures including heart rate monitoring, and video-based motion tracking. The protocol successfully captured measurable, interpretable differences across audio conditions, indicating that even subtle changes in audio design can shape emotional perception and immersion. Unconventional mixes tended to produce greater variability in audience interpretation, while conventional immersive mixes were associated with stronger cross-audience agreement. These findings establish the feasibility of the EMORSION protocol and motivate larger-scale studies to characterise the role of specific audio parameters in shaping audience experience.

2606.18480 2026-06-18 eess.AS cs.SD 交叉投稿

Generalised Transcoding Framework for Arbitrary Spatial Audio Capture and Playback Formats

任意空间音频采集与回放格式的通用转码框架

Archontis Politis, Janani Fernandez, Leo McCormack

发表机构 * Faculty of Information Technology and Communication Sciences, Tampere University(信息科技与通讯科学学院,塔尔库大学) Department of Information and Communications Engineering, Aalto University(信息与通讯工程系,阿尔托大学)

AI总结 提出一种统一框架,通过估计时频域空间元数据(包括主成分和环境成分的角功率分布),实现从Ambisonic或原始麦克风阵列信号到任意目标回放格式的转码,支持独立旋转,实验证明其优于现有参数化渲染器。

Comments This work has been submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing for possible publication

详情
AI中文摘要

本文介绍了一种统一框架,用于对以Ambisonic信号或原始麦克风阵列信号形式捕获的空间声场景进行参数化分析和再现。所提出的方法估计时频相关的空间元数据,该元数据表征可变数量的主源分量和具有自身角功率分布的环境分量,其参数拟合捕获信号的观测空间协方差。该元数据用于构建目标回放格式的空间协方差,然后用于推导最优混合矩阵,以将场景转码用于目标再现系统上的回放。该方法还独立处理采集和回放设置的旋转。在听力测试中,使用来自Ambisonic、球形和头戴式阵列的模拟场景,比较了该方法的实时实现和其他现有的最先进参数化渲染器。结果突出了所提出框架在多种内容和接收器配置下的感知优势,特别是对于低阶和几何约束的麦克风阵列。

英文摘要

This article introduces a unified framework for the parametric analysis and reproduction of spatial sound scenes captured either as Ambisonic signals or as raw microphone array signals. The proposed method estimates time-frequency-dependent spatial metadata that characterises a variable number of primary source components and an ambience component with its own angular power distribution, whose parameters fit the observed spatial covariances of the captured signals. This metadata is used to construct spatial covariances of the target playback formats, which are then used to derive optimal mixing matrices for transcoding the scene for playback over the target reproduction system. The method additionally handles independent rotations of both capture and playback setups. Real-time implementations of the method and other existing state-of-the-art parametric renderers are compared in a listening test using simulated scenes from Ambisonic, spherical, and head-worn arrays. The results highlight perceptual benefits of the proposed framework across a diverse range of content and receiver configurations, particularly for lower-order and geometrically constrained microphone arrays.

2606.18571 2026-06-18 cs.LG cs.CL cs.SD eess.AS 交叉投稿

Fair Cognitive Impairment Detection Through Unlearning

通过去学习实现公平的认知障碍检测

William Nguyen, Jiali Cheng, Hadi Amiri

发表机构 * University of Massachusetts Lowell, USA(马萨诸塞大学洛厄尔分校)

AI总结 提出一种多模态框架,结合跨模态融合和梯度反转去学习,减少人口统计信息对轻度认知障碍检测的偏见,在跨语言数据集上缩小性能差距。

Comments Interspeech 2026

详情
AI中文摘要

轻度认知障碍(MCI)是一种以记忆、语言或思维能力显著下降为特征的医学状况。从自发语音中检测MCI对于可扩展的筛查具有前景。然而,学习模型常常利用与标签相关的人口统计线索,导致不同亚组之间存在较大的性能差距。我们提出了一种多模态框架,结合了(i)模态间(语音、文本和图像)的跨模型融合,以及(ii)使用梯度反转的去学习,该技术阻止共享嵌入编码与任务无关的人口统计属性。在多语言基准TAUKADIAL和PREPARE上的评估表明,我们的方法在MCI分类上优于最先进的多语言和多模态基线,同时显著缩小了患者亚组(性别和语言)之间的性能差距。我们进一步分析了跨数据集的迁移,表明人口统计去学习有助于学习更鲁棒的MCI检测表示。

英文摘要

Mild Cognitive Impairment (MCI) is a medical condition characterized by a noticeable decline in memory, language, or thinking abilities. MCI detection from spontaneous speech is promising for scalable screening. However, learned models often exploit demographic cues correlated with labels, resulting in a large performance gap across subgroups. We present a multimodal framework that combines (i) cross-model fusion between modalities (speech, text, and image), and (ii) unlearning using gradient reversal that discourages the shared embedding from encoding task-irrelevant demographic attributes. Evaluated on the multilingual benchmarks TAUKADIAL and PREPARE, our method outperforms the state-of-the-art multilingual and multimodal baseline in MCI classification while substantially reducing the performance gap across patient subgroups (sex and language). We further analyze transfer across datasets, showing that demographic unlearning helps learn more robust representations for MCI detection.

2606.18979 2026-06-18 eess.AS cs.CL cs.SD 交叉投稿

Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia Assessment

缓解语音痴呆评估中的评分错误并补偿非语言子测试

Franziska Braun, Christopher Witzl, Andreas Erzigkeit, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer

AI总结 研究通过融合转录分数和Whisper嵌入减少语音评估中的评分错误,并利用融合表示近似专家整体评分以补偿缺失的运动子测试,有效区分认知状态组。

Comments Accepted at INTERSPEECH 2026

详情
AI中文摘要

认知障碍的早期检测依赖于神经心理学测试,通过评估多个认知领域来最小化主观性。基于语音的评估可以支持诊断并提高可及性,但转录错误和非语言子测试(如运动技能)的遗漏限制了准确性。除了传统的测试分数,语音衍生特征可以提供对认知状态的额外见解。本研究调查了德国“综合征短测试”的语音评估,这是一种标准化的痴呆筛查测试,包含语言和运动子测试。我们训练模型,整合每个语言子测试的转录衍生分数和Whisper嵌入,以减少评分错误。为了补偿缺失的运动子测试,我们利用这些融合表示来近似专家整体评分。尽管省略了子测试,我们的模型与专家评分高度相关,并能有效且准确地区分认知状态组。

英文摘要

Early detection of cognitive impairment relies on neuropsychological tests to minimize subjectivity by assessing multiple cognitive domains. Speech-based evaluation can support diagnostics and improve accessibility, but transcription errors and the omission of nonverbal subtests (e.g., motor skills) limit accuracy. Beyond conventional test scores, speech-derived features can provide additional insights into cognitive status. This study investigates the speech-based evaluation of the German "Syndrom-Kurz-Test," a standardized dementia screening test comprising verbal and motor subtests. We train models that integrate transcript-derived scores and Whisper embeddings per verbal subtest to reduce scoring errors. To compensate for missing motor subtests, we then leverage these fused representations to approximate expert overall ratings. Despite omitting subtests, our models strongly correlate with expert ratings and efficiently and accurately discriminate between cognitive status groups.

2606.19039 2026-06-18 cs.NE cs.LG cs.SD 交叉投稿

Adaptive Speech-to-Spike Encoding for Spiking Neural Networks

自适应语音到脉冲编码用于脉冲神经网络

Taharim Rahman Anon, Jakaria Islam Emon

发表机构 * PI LLC(1 PI LLC)

AI总结 提出一种可学习的残差语音到脉冲编码器,与R-LIF骨干网络联合训练,在GSC-v2上达94.97%准确率,参数高效且学习任务对齐的脉冲表示。

Comments Accepted at Interspeech 2026. This version is a preprint

详情
AI中文摘要

连续声学信号与离散事件驱动处理之间的不匹配仍然是神经形态语音处理的基本瓶颈。当前系统通常依赖固定的脉冲编码器,迫使下游脉冲神经网络(SNN)补偿非自适应的输入表示。为了解决这个问题,我们提出了一种可学习的残差语音到脉冲编码器,与循环漏积分点火(R-LIF)骨干网络进行端到端联合训练。我们在Google Speech Commands v2(GSC-v2)基准上验证了该方法,达到了高达94.97%的准确率。值得注意的是,学习到的编码器仍然高度参数高效,其紧凑的35k参数变体达到了89.8%,匹配或超过了需要多一个数量级参数的先前基线。我们以编码器为中心的分析,包括线性探测和梯度残差检查,表明编码器并不追求忠实的信号重建,而是学习任务对齐的脉冲表示,增强了类别可分性。最后,我们通过比较直接反馈对齐(DFA)和替代梯度BPTT在相同架构和训练条件下的表现,对生物启发、硬件友好的信用分配进行了基准测试。我们发现DFA达到了91.5%的准确率,量化了生物启发学习规则在现代神经形态音频中的性能权衡。

英文摘要

The mismatch between continuous acoustic signals and discrete event-driven processing remains a fundamental bottleneck for neuromorphic speech processing. Current systems typically rely on fixed spike encoders, forcing downstream Spiking Neural Networks (SNNs) to compensate for non-adaptive input representations. To address this, we present a learnable residual speech-to-spike encoder jointly trained end-to-end with a Recurrent Leaky Integrate-and-Fire (R-LIF) backbone. We validate this approach on the Google Speech Commands v2 (GSC-v2) benchmark, achieving up to 94.97% accuracy. Notably, the learned encoder remains highly parameter-efficient with a compact 35k-parameter variant that reaches 89.8%, matching or exceeding prior baselines that require an order of magnitude more parameters. Our encoder-focused analysis, including linear probing and gradient-residual inspection, indicates that the encoder does not target faithful signal reconstruction but instead learns task-aligned spike representations that enhance class separability. Finally, we benchmark bio-inspired, hardware-friendly credit assignment by comparing Direct Feedback Alignment (DFA) with surrogate-gradient BPTT under identical architectures and training conditions. We find that DFA reaches 91.5% accuracy, quantifying the performance trade-off of bio-inspired learning rules for modern neuromorphic audio.