arXivDaily arXiv每日学术速递 周一至周五更新
重置
cs.MM多媒体4
2606.11828 2026-06-11 cs.SD cs.AI cs.CR cs.MM 新提交

Feature-Aligned Speech Watermarking for Robustness to Reconstruction Distortions

特征对齐的语音水印技术以抵抗重建失真

Haiyun Li, Shuhai Peng, Zhisheng Zhang, Jingran Xie, Xiaofeng Xie, Hanyang Peng, Zhiyong Wu

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Shenzhen Key Laboratory of Intelligent Media and Content Understanding(深圳市智能媒体与内容理解重点实验室) Tencent AI Lab(腾讯人工智能实验室)

AI总结 提出特征对齐水印方法,通过将水印与原始语音特征分布对齐,在保持不可感知性的同时提高水印能量,增强对语音重建模型的鲁棒性。

详情
Comments
Accepted by ICME2026
AI中文摘要

音频水印旨在将可识别信息嵌入音频中同时保持不可感知性。现有方法采用高保真、低能量设计以保持感知质量,但由此产生的水印在语音重建模型的抑制下缺乏鲁棒性。由于现有设计中固有的鲁棒性-保真度权衡,提高鲁棒性具有挑战性,增加水印能量会提高鲁棒性但降低保真度。为解决此问题,我们提出一种特征对齐的水印方法,将水印与原始语音特征分布对齐,允许更高的水印能量以提高鲁棒性同时保持不可感知性。我们使用预训练的语音编解码器生成伪语音水印,并将其融合到输入音频的频谱图中,通过VAD损失和感知损失引导在浊音区域嵌入。实验表明,我们的方法在保持与现有方法相当的不可感知性的同时,在见过和未见过的语音重建模型下均显著提高了鲁棒性。

英文摘要

Audio watermarking aims to embed identifiable information into audio while remaining imperceptible. Existing methods adopt high-fidelity, low-energy designs to preserve perceptual quality, but the resulting watermarks lack robustness under suppression by speech reconstruction models. Improving robustness is challenging due to the inherent robustness-fidelity trade-off in existing designs, where increasing watermark energy improves robustness but reduces fidelity. To address this problem, we propose a feature-aligned watermarking method that aligns the watermark with the original speech feature distribution, allowing higher watermark energy to improve robustness while preserving imperceptibility. We use a pretrained speech codec to generate a pseudo-speech watermark and fuse it into the spectrogram of the input audio, with VAD loss and perceptual losses guiding embedding within voiced regions. Experiments show that our method maintains imperceptibility comparable to existing approaches while substantially improving robustness under both seen and unseen speech reconstruction models.

2606.11210 2026-06-11 cs.CL cs.AI cs.MM 新提交

T2MM: An LLM Supported Architecture For Inquiry-Based Modeling

T2MM:一种支持基于探究建模的LLM架构

John Kos, Rudra Singh, Ashok Goel

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出T2MM架构,利用LLM在生态建模软件VERA中生成交互式模型,优于全代码生成基线。

详情
Comments
16 pages, 4 figures
AI中文摘要

模型构建是科学学习中的基础实践,依赖于可视化和交互性。大型语言模型(LLM)越来越多地增强多模态能力,并已集成到教育环境中以支持学习。然而,这些工具缺乏某些学习环境所需的视觉交互性。我们提出了文本到多模态模型(T2MM),这是一种稳健、动态的LLM支持架构,可在开放探究生态建模软件虚拟实验研究助手(VERA)中辅助模型构建。T2MM考虑学习者模型的当前上下文,并创建交互式模型(而非静态图像),使模型能够对人工调整保持响应。为了衡量技术可行性,我们通过一个自定义的程序生成数据集(包含自然语言学习者建模请求和VERA系统中的目标模型)来评估T2MM。在所有测量的成功指标上,T2MM优于通过LLM支持的全代码生成实现的基线模型生成架构(这在文献中很常见)。我们的贡献不仅概述了将LLM集成到基于探究的学习建模工具中,还描述了一种可能的架构,通过该架构可以创建更具交互性的多模态LLM工具。

英文摘要

Model Construction is a foundational practice in science learning that relies on visualization and interactivity. Large Language Models, increasingly augmented with multimodal capabilities, have been integrated in education contexts to support learning. However, these tools lack visual interactivity that is required by some learning contexts. We introduce Text to Multimodal Model (T2MM), a robust, dynamic LLM supported architecture that assists in model construction within the open inquiry ecology-based modeling software Virtual Experimental Research Assistant (VERA). T2MM accounts for the current context of the learner's model and creates interactive models, rather than static images, enabling the model to remain responsive to manual adjustment. To measure technical feasibility, we evaluate T2MM through a custom procedurally generated dataset of natural language learner modeling requests and target models within the VERA system. T2MM outperforms a baseline model generation architecture implemented through LLM-supported full code generation, common in the literature, across all measured success metrics. Our contribution not only outlines LLM integration into a inquiry-based learning modeling tool, but also describes a possible architecture through which more interactive multimodal LLM tools can be created.

2601.18934 2026-06-11 cs.HC cs.MM 版本更新

Whispering Water: Materializing Human-AI Dialogue as Interactive Ripples

低语之水:将人机对话物化为交互式涟漪

Ruipeng Wang, Tawab Safi, Yunge Wen, Christina Cunningham, Hoi Ling Tang, Behnaz Farahi

AI总结 通过将语音情感转换为激发频率、语义内容输入多智能体LLM系统,以及用对数间距和Bark尺度映射分解合成语音为谐波分量,在物理水面上实现人机对话的物化。

详情
AI中文摘要

水在不同文化中长久以来一直作为人类忏悔的接受者。我们呈现《低语之水》,一个通过水面上声波图案物化人机对话的互动装置。参与者向水面忏悔,触发一个四阶段仪式:忏悔、沉思、回应和释放。语音情感被转换为激发频率,调节水的物理状态,而语义内容进入一个由异构LLM组成的多智能体系统,其身份通过情境对话涌现。一种新颖的算法通过对数间距和Bark尺度映射将合成语音分解为谐波分量,将机器声音重构为物理波叠加。该装置通过感官丰富、仪式化框架下的人机互动探索情感自我探索。

英文摘要

Water has long served as a recipient of human confession across cultures. We present \textit{Whispering Water}, an interactive installation that materializes human-AI dialogue through cymatic patterns on water. Participants confess to a water surface, triggering a four-phase ritual: confession, contemplation, response, and release. Speech sentiment is translated into excitation frequencies that prime the water's physical state, while semantic content enters a multi-agent system of heterogeneous LLMs whose identities emerge through situated discourse. A novel algorithm decomposes synthesized speech into harmonic components via logarithmic spacing and Bark-scale mapping, reconstructing machine voices as physical wave superpositions. The installation explores emotional self-exploration through sensory-rich, ritually framed human-AI interaction.

2405.06995 2026-06-11 cs.SD cs.CV cs.MM eess.AS 版本更新

Benchmarking Cross-Domain Audio-Visual Deception Detection

跨域音视频欺骗检测基准测试

Xiaobao Guo, Zitong Yu, Nithish Muthuchamy Selvaraj, Bingquan Shen, Adams Wai-Kin Kong, Alex C. Kot

AI总结 提出首个跨域音视频欺骗检测基准,评估不同场景下的泛化能力,并设计MM-IDGM算法和Attention-Mixer融合方法提升性能。

详情
Comments
17 pages
AI中文摘要

自动欺骗检测对于帮助人类准确评估真实性和识别欺骗行为至关重要。传统的接触式技术,如测谎仪,依赖生理信号来确定个体陈述的真实性。然而,自动欺骗检测的最新进展表明,从音频和视频模态中提取的多模态特征在公开数据集上可能优于人类观察者。尽管有这些积极发现,现有音视频欺骗检测方法在不同场景下的泛化能力仍 largely unexplored。为弥补这一空白,我们提出了首个跨域音视频欺骗检测基准,使我们能够评估这些方法在现实场景中的泛化能力。我们使用了广泛采用的音频和视觉特征以及不同的架构进行基准测试,比较了单到单和多到单域泛化性能。为了进一步利用来自多个源域的数据进行训练的影响,我们研究了三种域采样策略,包括域同步、域交替和逐域采样,用于多到单域泛化评估。我们还提出了一种通过最大化模态编码器之间的梯度内积来增强泛化性能的算法,称为“MM-IDGM”。此外,我们提出了Attention-Mixer融合方法来提高性能,并相信这一新的跨域基准将促进未来音视频欺骗检测的研究。

英文摘要

Automated deception detection is crucial for assisting humans in accurately assessing truthfulness and identifying deceptive behavior. Conventional contact-based techniques, like polygraph devices, rely on physiological signals to determine the authenticity of an individual's statements. Nevertheless, recent developments in automated deception detection have demonstrated that multimodal features derived from both audio and video modalities may outperform human observers on publicly available datasets. Despite these positive findings, the generalizability of existing audio-visual deception detection approaches across different scenarios remains largely unexplored. To close this gap, we present the first cross-domain audio-visual deception detection benchmark, that enables us to assess how well these methods generalize for use in real-world scenarios. We used widely adopted audio and visual features and different architectures for benchmarking, comparing single-to-single and multi-to-single domain generalization performance. To further exploit the impacts using data from multiple source domains for training, we investigate three types of domain sampling strategies, including domain-simultaneous, domain-alternating, and domain-by-domain for multi-to-single domain generalization evaluation. We also propose an algorithm to enhance the generalization performance by maximizing the gradient inner products between modality encoders, named ``MM-IDGM". Furthermore, we proposed the Attention-Mixer fusion method to improve performance, and we believe that this new cross-domain benchmark will facilitate future research in audio-visual deception detection.