arXivDaily arXiv每日学术速递 周一至周五更新
2605.01515 2026-05-05 cs.SD cs.CR 版本更新

MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech

Yutong Jin, Qi Li, Lingshuang Liu, Jianbing Ni

Comments Accepted by ACISP 2026

详情
英文摘要

In this paper, we propose MelShield, a robust, in-generation, keyed audio watermarking framework that embeds identifiable signals into AI-generated audio for copyright protection and reliable attribution. Specifically, MelShield operates in the Mel-spectrogram domain during the generation process, targeting intermediate acoustic representations in Mel-conditioned pipelines for text-to-speech (TTS) generation. The core idea is to treat the intermediate Mel-spectrogram as the host signal and embed a short binary payload via low-energy, keyed spread-spectrum perturbations distributed across carefully selected time-frequency regions prior to waveform synthesis. By performing watermarking before vocoder inference, MelShield remains plug-and-play for Mel-conditioned TTS architectures and does not require modification or retraining of the underlying TTS generation vocoder, such as DiffWave and HiFi-GAN. Moreover, the multi-user keyed construction enables scalable user-specific attribution, while the keyed verification mechanism limits unauthorized decoding, thereby reducing the risk of large-scale extractor probing and adversarial analysis. Extensive experiments on DiffWave and HiFi-GAN demonstrate that MelShield achieves reliable watermark extraction, approaching 100\% bit accuracy, even under signal distortions, e.g., compression and additive noise, while preserving high perceptual audio quality.

2605.01219 2026-05-05 cs.MM cs.CV cs.SD eess.IV 版本更新

Multimodal Confidence Modeling in Audio-Visual Quality Assessment

Mayesha Maliha R. Mithila, Mylene C. Q. Farias

Comments Accepted at ICIP 2026, 6 pages, 4 figures, no supplementary material

详情
英文摘要

Audio-visual quality assessment (AVQA) is essential for streaming, teleconferencing, and immersive media. In realistic streaming scenarios, distortions are often asymmetric, where one modality may be severely degraded while the other remains clean. Still, most contemporary AVQA metrics treat audio and video as equally reliable, causing confidence-unaware fusion to emphasize unreliable signals. This paper proposes MCM-AVQA, a multimodal confidence-aware AVQA framework that explicitly estimates modality-specific confidence and injects it into a dedicated audio-visual mixer for cross-modal attention. The Audio-Visual Mixer utilizes frame-level, confidence-guided channel attention to gate fusion, modulating feature interaction between modalities so that high-confidence streams dominate while unreliable inputs are suppressed, preserving temporal degradation patterns. A multi-head visual confidence estimator turns frame-level artifact probabilities into temporally smoothed, clip-level visual confidence scores, while an audio confidence module derives confidence from speech-quality cues without requiring a clean reference. Experiments on multiple AVQA benchmarks show that MCM-AVQA, and specifically its confidence-guided Audio-Visual Mixer, improve correlation with human mean opinion scores and yield more interpretable behavior under real-world asymmetric audio-visual distortions.

2605.01197 2026-05-05 cs.SD cs.MM 版本更新

MG-Former: A Transformer-Based Framework for Music-Driven 3D Conducting Gesture Generation

Ke Qiu, Yawen Qin, Tianzhi Jia, Xiaole Yang, Kaimin Wang, Kaixing Yang

详情
英文摘要

Generating expressive conducting gestures from music is a challenging cross-modal motion synthesis problem: the output must follow long-range musical structure, preserve beat-level synchronization, and remain plausible as a fine-grained 3D human performance. Existing conducting-motion studies are often limited by sparse pose representations, small-scale data, or evaluation protocols that do not directly measure whether music and gesture are mutually aligned. This paper presents TransConductor, a Transformer-based framework for music-driven conducting gesture generation. We introduce ConductorMotion, a SMPL-parameter data construction pipeline that recovers detailed body motion from conducting videos and forms a dataset targeted at professional conducting gestures. Given acoustic descriptors extracted from audio and an initial pose, TransConductor uses a Trans-Temporal Music Encoder and a Trans-Temporal Conducting Gesture Decoder to autoregressively predict SMPL pose parameters. To better assess artistic correspondence, we further build a retrieval-based evaluation model that embeds music and gestures into a shared space and yields FID, modality distance, multi-modality distance, and diversity metrics. Experiments show that TransConductor outperforms dance-generation and conducting-generation baselines, while ablations verify the benefits of the Transformer backbone and the proposed alignment loss.

2605.00865 2026-05-05 eess.SP cs.CL cs.CV cs.LG cs.SD q-bio.NC 版本更新

How Well Can We Decode Vowels from Auditory EEG -- A Rigorous Cross-Subject Benchmark with Honest Assessment

Xiaoyang Li

Comments 31 pages, 11 figures; includes supplementary material (14 pages, additional figures and analyses)

详情
英文摘要

EEG based phoneme decoding is promising for brain computer interfaces, but many prior studies rely on within subject evaluation, small cohorts, or weak leakage control. We present a reproducible cross subject benchmark for five class vowel decoding (a, e, i, o, u) from auditory EEG using OpenNeuro ds006104 (16 subjects, 61 channels, 256 Hz). Under strict leave one subject out evaluation with training only normalization and explicit anti leakage checks, we compare 14 pipelines from classical machine learning, deep learning, and Riemannian methods. The best full feature model (XGBoost) reaches 24.5 percent accuracy (chance 20 percent), while differential entropy features with LightGBM reach 25.5 percent in feature specific analysis. After multiple comparison correction, strong pairwise model advantages are limited. Classical methods are competitive with deep models in this low signal regime. Additional analyses (ablation, pairwise vowels, within subject CV, ERP, temporal generalization, and electrode importance) indicate that vowel information is real but weak and mainly carried by early transient auditory responses. We release code and evaluation scripts for full reproducibility.

2604.15037 2026-05-05 cs.AI cs.CL cs.SD 版本更新

From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

Ke Xu, Yuhao Wang, Yu Wang

Comments Submitted to Interspeech 2026

详情
英文摘要

Recent advancements in LLM agents are gradually shifting from reactive, text-based paradigms toward proactive, multimodal interaction. However, existing benchmarks primarily focus on reactive responses, overlooking the complexities of proactive intervention and monitoring. To bridge this gap, we introduce ProVoice-Bench, the first evaluation framework specifically designed for proactive voice agents, featuring four novel tasks. By leveraging a multi-stage data synthesis pipeline, we curate 1,182 high-quality samples for rigorous testing. Our evaluation of state-of-the-art Multimodal LLMs reveals a significant performance gap, particularly regarding over-triggering and reasoning capabilities. These findings highlight the limitations of current models and offer a roadmap for developing more natural, context-aware proactive agents.

2510.10785 2026-05-05 cs.SD 版本更新

FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversion with Factorized Speech Codec

Yurii Halychanskyi, Cameron Churchwell, Yutong Wen, Volodymyr Kindratenko

Comments 5 pages, 2 figures. Accepted to ICASSP 2026

Journal ref ICASSP 2026, Barcelona, Spain, 2026, pp. 17052-17056

详情
英文摘要

Previous accent conversion (AC) methods, including foreign accent conversion (FAC), lack explicit control over the degree of modification. Because accent modification can alter the perceived speaker identity, balancing conversion strength and identity preservation is crucial. We present an AC framework that provides an explicit, user-controllable parameter to adjust the strength of pronunciation-level accent modification. Results show performance comparable to recent AC systems, stronger preservation of speaker identity, and unique support for controllable accent conversion.

2510.08599 2026-05-05 eess.AS cs.AI cs.CL cs.SD 版本更新

BaldWhisper: Faster Whisper with Head Shearing and Layer Merging

Yaya Sy, Christophe Cerisara, Irina Illina

详情
英文摘要

Pruning large pre-trained transformers in a data-scarce scenario is challenging, as it often requires massive retraining data to recover performance. For instance, Distill-Whisper prunes Whisper by 40 and retrains on 21,000 hours of speech, far beyond what is available for most languages. Can Whisper be made lighter and faster for edge devices in data-scarce settings? Focusing on Bambara with only 32h of speech-to-text data, we propose a new pruning recipe. Instead of vocabulary pruning, which is unsuitable due to frequent code-switching by Bambara speakers, we compress the embeddings with low-rank decomposition and feature distillation. Rather than removing layers, we merge them to limit performance loss. The final model preserves 90 of the original performance while being 48 smaller and 2.15x faster on a MacBook Air M1.

2509.17661 2026-05-05 eess.AS cs.SD 版本更新

Comparator Loss: An Ordinal Contrastive Loss to Derive a Severity Score for Speech-based Health Monitoring

Jacob J Webber, Oliver Watts, Lovisa Wihlborg, Johnny Tam, Christine Weaver, Suvankar Pal, Siddharthan Chandran, Cassia Valentini-Botinhao

Comments Accepted to Odyssey 2026

详情
英文摘要

Monitoring the progression of neurodegenerative disease (NDD) has important applications in planning treatment and evaluating new medications. Whereas much work has focused on discriminating patients from healthy controls, or predicting real-world health metrics, we propose a novel measure of disease progression: the severity score, derived from a model trained to minimize what we call the comparator loss. This loss ensures scores obey an ordering relation, based on diagnosis, clinical scores, or simply chronological order of recordings. The proposed comparator loss-based system has the potential to incorporate information from disparate health metrics, critical for making full use of small health-related datasets. We show that a model trained on lightly annotated data is capable of distinguishing between subjects with NDDs and healthy controls. Our score also correlates with annotations not observed in training, such as ALSFRS-R and those of speech and language therapists.

2507.13563 2026-05-05 cs.CL cs.SD eess.AS 版本更新

Balalaika: Data-Centric, Prosody-Aware Annotation Pipeline for Russian Speech

Kirill Borodin, Nikita Vasiliev, Vasiliy Kudryavtsev, Maxim Maslov, Mikhail Gorodnichev, Grach Mkrtchian

Comments The work is still in progress. Submitted to Interspeech 2026

详情
英文摘要

We introduce Balalaika, an open-source, data-centric pipeline for processing audio and producing prosody-aware annotations. It combines semantic VAD for context-preserving segmentation, multi-ASR ensembling with ROVER consensus decoding, while retaining optional word-level timestamps, followed by automatic quality and speaker-purity filtering. The text is further enriched with punctuation restoration, lexical stress and "\textipa{e}/\textipa{He}" normalization, and IPA phonemes. Using Balalaika, we build a 5.1k-hour multi-source Russian corpus with rich annotations, and show consistent gains under equalized training budgets for both speech denoising and TTS; ablations confirm complementary benefits of stress and punctuation and improved synthesis with stricter MOS filtering. The datasets are publicly available at \href{https://huggingface.co/collections/lab260/balalaika-dataset}{\underline{\textbf{HuggingFace}}}