arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

检索范围排序方式

检索时间范围

重置

HOT 人工智能、机器人等 9

cs.AI 人工智能 cs.CV 计算机视觉 cs.CL 自然语言处理 cs.RO 机器人 cs.LG 机器学习 cs.SD 声音 cs.ET 新兴技术 eess.AS 音频语音 eess.IV 图像视频

CS 计算机 41

cs 计算机 cs.AI 人工智能 cs.AR 硬件架构 cs.CC 计算复杂性 cs.CE 计算工程 cs.CG 计算几何 cs.CL 自然语言处理 cs.CR 密码安全 cs.CV 计算机视觉 cs.CY 计算机与社会 cs.DB 数据库 cs.DC 分布式计算 cs.DL 数字图书馆 cs.DM 离散数学 cs.DS 数据结构 cs.ET 新兴技术 cs.FL 形式语言 cs.GL 综述文献 cs.GR 图形学 cs.GT 博弈论 cs.HC 人机交互 cs.IR 信息检索 cs.IT 信息论 cs.LG 机器学习 cs.LO 计算机逻辑 cs.MA 多智能体 cs.MM 多媒体 cs.MS 数学软件 cs.NA 数值分析 cs.NE 神经进化 cs.NI 网络架构 cs.OH 其他计算机 cs.OS 操作系统 cs.PF 性能 cs.PL 编程语言 cs.RO 机器人 cs.SC 符号计算 cs.SD 声音 cs.SE 软件工程 cs.SI 社会信息网络 cs.SY 系统控制

ECON 经济学 4

econ 经济学 econ.EM 计量经济 econ.GN 一般经济 econ.TH 理论经济

EESS 电气与系统 5

eess 电气与系统 eess.AS 音频语音 eess.IV 图像视频 eess.SP 信号处理 eess.SY 系统控制

MATH 数学 33

math 数学 math.AC 交换代数 math.AG 代数几何 math.AP 偏微分方程 math.AT 代数拓扑 math.CA 经典分析 math.CO 组合数学 math.CT 范畴论 math.CV 复变函数 math.DG 微分几何 math.DS 动力系统 math.FA 泛函分析 math.GM 一般数学 math.GN 一般拓扑 math.GR 群论 math.GT 几何拓扑 math.HO 历史综述 math.IT 信息论 math.KT K理论 math.LO 逻辑 math.MG 度量几何 math.MP 数学物理 math.NA 数值分析 math.NT 数论 math.OA 算子代数 math.OC 优化控制 math.PR 概率 math.QA 量子代数 math.RA 环与代数 math.RT 表示论 math.SG 辛几何 math.SP 谱理论 math.ST 统计理论

PHYSICS 物理 55

astro-ph 天体物理 astro-ph.CO 宇宙学 astro-ph.EP 地球行星 astro-ph.GA 星系物理 astro-ph.HE 高能天体 astro-ph.IM 天文仪器 astro-ph.SR 太阳恒星 cond-mat 凝聚态 cond-mat.dis-nn 无序神经 cond-mat.mes-hall 介观纳米 cond-mat.mtrl-sci 材料科学 cond-mat.other 其他凝聚态 cond-mat.quant-gas 量子气体 cond-mat.soft 软凝聚态 cond-mat.stat-mech 统计力学 cond-mat.str-el 强关联电子 cond-mat.supr-con 超导 gr-qc 广义相对论 hep-ex 高能实验 hep-lat 格点高能 hep-ph 高能唯象 hep-th 高能理论 math-ph 数学物理 nlin 非线性科学 nlin.AO 自适应系统 nlin.CD 混沌动力学 nlin.CG 胞自动机 nlin.PS 斑图孤子 nlin.SI 可积系统 nucl-ex 核物理实验 nucl-th 核物理理论 physics 物理 physics.acc-ph 加速器物理 physics.ao-ph 大气海洋 physics.app-ph 应用物理 physics.atm-clus 原子分子团簇 physics.atom-ph 原子物理 physics.bio-ph 生物物理 physics.chem-ph 化学物理 physics.class-ph 经典物理 physics.comp-ph 计算物理 physics.data-an 数据分析 physics.ed-ph 物理教育 physics.flu-dyn 流体动力学 physics.gen-ph 普通物理 physics.geo-ph 地球物理 physics.hist-ph 物理史哲 physics.ins-det 仪器探测 physics.med-ph 医学物理 physics.optics 光学 physics.plasm-ph 等离子体 physics.pop-ph 科普物理 physics.soc-ph 物理与社会 physics.space-ph 空间物理 quant-ph 量子物理

Q-BIO 定量生物 11

q-bio 定量生物 q-bio.BM 生物分子 q-bio.CB 细胞行为 q-bio.GN 基因组学 q-bio.MN 分子网络 q-bio.NC 神经认知 q-bio.OT 其他定量生物 q-bio.PE 种群进化 q-bio.QM 定量方法 q-bio.SC 亚细胞过程 q-bio.TO 组织器官

Q-FIN 定量金融 10

q-fin 定量金融 q-fin.CP 计算金融 q-fin.EC 经济学 q-fin.GN 一般金融 q-fin.MF 数学金融 q-fin.PM 投资组合 q-fin.PR 证券定价 q-fin.RM 风险管理 q-fin.ST 统计金融 q-fin.TR 交易微观结构

STAT 统计 7

stat 统计 stat.AP 统计应用 stat.CO 统计计算 stat.ME 统计方法 stat.ML 机器学习 stat.OT 其他统计 stat.TH 统计理论

2411.17690 2026-04-21 cs.MM cs.CV cs.SD eess.AS 版本更新

Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis

Akshita Gupta, Tatiana Likhomanenko, Karren Dai Yang, Richard He Bai, Zakaria Aldeneh, Navdeep Jaitly

Comments 30 pages, Decoder-only model, Speech Synthesis

详情

英文摘要

Unified decoder-only transformers have shown promise for multimodal generation, yet the mechanisms by which they synchronize modalities with heterogeneous sampling rates remain underexplored. We investigate these mechanisms through video-text-to-speech (VTTS) synthesis-a controlled task requiring fine-grained temporal alignment between sparse text, video, and continuous speech. Using a unified decoder-only transformer, dubbed Visatronic, trained on VoxCeleb2, we study: (i) how modalities contribute complementary information, (ii) how positional encoding strategies enable synchronization across heterogeneous rates, (iii) how modality ordering shapes the trade-off between in-domain performance and cross-domain transfer, (iv) how phoneme-level synchronization metrics provide diagnostic insight into per-phoneme timing errors. Our findings reveal that both "global sequential indexing'' (unique position IDs across modalities) and "co-temporal ordered indexing'' (identical IDs for temporally corresponding tokens) achieve strong synchronization performance, with co-temporal ordered indexing providing a simple mechanism without explicit timestamp metadata. Both text and video contribute complementary signals: text ensures intelligibility while video provides temporal cues and emotional expressiveness. Modality ordering reveals a consistent trade-off: video-first ordering achieves stronger in-domain performance while text-first ordering generalizes more robustly to unseen domains. Our findings also reveal, that diverse large-scale training enables transferable synchronization strategies. To enable fine-grained analysis, we also introduce TimeSync, a phoneme-level metric that reveals temporal misalignments overlooked by frame-level metrics. These insights establish VTTS as a valuable testbed for understanding temporal synchronization in unified multimodal decoders.

URL PDF HTML ☆

赞 0 踩 0

2604.18489 2026-04-21 cs.SD cs.CL eess.AS 版本更新

Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints

Hao Meng, Siyuan Zheng, Shuran Zhou, Qiangqiang Wang, Yang Song

Comments Accepted by IEEE ICASSP 2026

2604.18187 2026-04-21 cs.SD cs.CL 版本更新

Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

Xiang He, Chenxing Li, Jinting Wang, Yan Rong, Tianxin Xie, Wenfu Wang, Li Liu, Dong Yu

详情

英文摘要

Large Audio-Language Models (LALMs) have made significant progress in audio understanding, yet they primarily operate as perception-and-answer systems without explicit reasoning processes. Existing methods for enhancing audio reasoning rely either on supervised chain-of-thought (CoT) fine-tuning, which is limited by training data quality, or on reinforcement learning (RL) with coarse rewards that do not directly evaluate reasoning quality. As a result, the generated reasoning chains often appear well-structured yet lack specific acoustic grounding. We propose Audio-DeepThinker, a framework built on two core ideas. First, we introduce a hybrid reasoning similarity reward that directly supervises the quality of generated reasoning chains by combining an LLM evaluator assessing logical path alignment, key step coverage, and analytical depth with an embedding similarity component enforcing semantic alignment with reference reasoning chains. Second, we propose a progressive two-stage curriculum that enables high-quality CoT reasoning to emerge through pure RL exploration, without any supervised reasoning fine-tuning, from an instruction-tuned model that possesses no prior chain-of-thought capability. Stage 1 trains on foundational audio QA with the hybrid reward to foster basic reasoning patterns, while Stage 2 shifts to acoustically challenging boundary cases with an LLM-only reward for greater reasoning diversity. Audio-DeepThinker achieves state-of-the-art results on MMAR (74.0%), MMAU-test-mini (78.5%), and MMSU (77.26%), winning 1st Place in the Interspeech 2026 Audio Reasoning Challenge (Single Model Track). Interpretability analyses further reveal that RL training primarily reshapes upper-layer MoE gating mechanisms and that reasoning tokens crystallize progressively in the upper transformer layers, offering mechanistic insights into how audio reasoning emerges through exploration.

URL PDF HTML ☆

赞 0 踩 0

2604.17986 2026-04-21 cs.SD cs.AI 版本更新

Latent Fourier Transform

Mason Wang, Cheng-Zhi Anna Huang

Comments ICLR 2026 Oral

2604.17958 2026-04-21 eess.AS cs.SD 版本更新

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

Huakang Chen, Jingbin Hu, Liumeng Xue, Qirui Zhan, Wenhao Li, Guobin Ma, Hanke Xie, Dake Guo, Linhan Ma, Yuepeng Jiang, Bengu Wu, Pengyuan Xie, Chuan Xie, Qiang Zhang, Lei Xie

2604.17852 2026-04-21 cs.SD 版本更新

LLM-Codec: Neural Audio Codec Meets Language Model Objectives

Ho-Lam Chung, Yiming Chen, Hung-yi Lee

Comments ACL2026 Finding

2604.16254 2026-04-21 cs.SD eess.AS 版本更新

ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics

Heewon Oh

Comments v2: Added SONICS 3-way (n=23,288), OOD taxonomy, benchmark coverage table, baseline reproduction appendix; toned-down claims; reframed discussion as asymmetric defender advantage. 8 pages, 6 figs, 12 tables

2604.14548 2026-04-21 cs.SD cs.LG eess.AS 版本更新

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

Yuxiang Wang, Hongyu Liu, Yijiang Xu, Qinke Ni, Li Wang, Wan Lin, Kunyu Feng, Dekun Chen, Xu Tan, Lei Wang, Jie Shi, Zhizheng Wu

2601.20867 2026-04-21 cs.SD cs.AI eess.AS 版本更新

Generalizable Prompt Tuning for Audio-Language Models via Semantic Expansion

Jaehyuk Jang, Wonjun Lee, Kangwook Ko, Changick Kim

Comments ACL 2026 findings

2601.10384 2026-04-21 cs.SD 版本更新

RSA-Bench: Benchmarking Audio Large Models in Real-World Acoustic Scenarios

Yibo Zhang, Liang Lin, Kaiwen Luo, Shilinlu Yan, Jin Wang, Yaoqi Guo, Yitian Chen, Yalan Qin, Zhenhong Zhou, Kun Wang, Li Sun

2601.05543 2026-04-21 cs.CL cs.SD eess.AS 版本更新

Closing the Modality Reasoning Gap for Speech Large Language Models

Chaoren Wang, Heng Lu, Xueyao Zhang, Shujie Liu, Yan Lu, Jinyu Li, Zhizheng Wu

Comments Accepted by ACL 2026 Main Conference

2510.08878 2026-04-21 cs.SD cs.AI cs.CL eess.AS 版本更新

ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling

Yuxuan Jiang, Zehua Chen, Zeqian Ju, Yusheng Dai, Weibei Dou, Jun Zhu

Comments Accepted at ACL 2026 Main

2510.06201 2026-04-21 eess.AS cs.AI cs.CL cs.LG cs.SD 版本更新

TokenChain: A Discrete Speech Chain via Semantic Token Modeling

Mingxuan Wang, Satoshi Nakamura

Comments 5 pages, 3 figures. Submitted to IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026

2509.14804 2026-04-21 cs.SD eess.AS 版本更新

Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages

Mingchen Shao, Bingshen Mu, Chengyou Wang, Hai Li, Ying Yan, Zhonghua Fu, Lei Xie

2504.08644 2026-04-21 eess.AS cs.SD eess.SP 版本更新

Reverberation-based Features for Sound Event Localization and Detection with Distance Estimation

Davide Berghi, Philip J. B. Jackson

Journal ref IEEE Signal Processing Letters 2026

2604.17823 2026-04-21 cs.SD cs.AI cs.CL 版本更新

A novel LSTM music generator based on the fractional time-frequency feature extraction

Li Ya, Chen Wei, Li Xiulai, Yu Lei, Deng Xinyi, Chen Chaofan

Comments This work was supported by Hainan Provincial Natural Science Foundation of China (Grant No. 723QN238)

2604.17435 2026-04-21 cs.CL cs.AI cs.SD eess.AS 版本更新

MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

Szu-Chi Chen, I-Ning Tsai, Yi-Cheng Lin, Sung-Feng Huang, Hung-yi Lee

Comments Submitted to Interspeech. Audio Demo and Dataset: https://47zzz.github.io/MoVE/

2604.17358 2026-04-21 cs.CL cs.AI cs.SD 版本更新

Still Between Us? Evaluating and Improving Voice Assistant Robustness to Third-Party Interruptions

Dongwook Lee, Eunwoo Song, Che Hyun Lee, Heeseung Kim, Sungroh Yoon

Comments ACL 2026 main conference

2604.17248 2026-04-21 eess.AS cs.CL cs.SD 版本更新

VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech

Yi-Cheng Lin, Yusuke Hirota, Sung-Feng Huang, Hung-yi Lee

Comments Submitted to INTERSPEECH 2026

2604.14654 2026-04-21 cs.SD eess.AS 版本更新

ClariCodec: Optimising Neural Speech Codes for 200bps Communication using Reinforcement Learning

Junyi Wang, Chi Zhang, Jing Qian, Haifeng Luo, Hao Wang, Zengrui Jin, Chao Zhang

Comments Withdrawn by the authors due to incomplete bitrate accounting in the ILN-based pipeline. The side information introduced by ILN was not fully included in the effective bitrate, making the reported 200 bps results and related comparisons unreliable. The withdrawal does not concern the paper's core RL-based methodological idea. A corrected version may follow

2604.11552 2026-04-21 cs.SD cs.CL 版本更新

MimicLM: Zero-Shot Voice Imitation through Autoregressive Modeling of Pseudo-Parallel Speech Corpora

Tao Feng, Yuxiang Wang, Yuancheng Wang, Xueyao Zhang, Dekun Chen, Chaoren Wang, Xun Guan, Zhizheng Wu

2601.04744 2026-04-21 cs.SD cs.AI 版本更新

Semi-Supervised Diseased Detection from Speech Dialogues with Multi-Level Data Modeling

Xingyuan Li, Mengyue Wu

Comments Accepted for publication as a Findings paper at the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

2601.03632 2026-04-21 eess.AS cs.AI cs.SD 版本更新

ReStyle-TTS: Relative and Continuous Style Control for Zero-Shot Speech Synthesis

Haitao Li, Chunxiang Jin, Chenglin Li, Wenhao Guan, Zhengxing Huang, Xie Chen

Comments ACL 2026

2512.03563 2026-04-21 cs.SD cs.AI 版本更新

State Space Models for Bioacoustics: A Comparative Evaluation with Transformers

Chengyu Tang, Sanjeev Baskiyar

2510.23969 2026-04-21 cs.SD cs.CL eess.AS 版本更新

emg2speech: Synthesizing speech from electromyography using self-supervised speech models

Harshavardhana T. Gowda, Daniel C. Comstock, Lee M. Miller

2508.08775 2026-04-21 cs.SD cs.GR cs.NA math.NA 版本更新

SonicRadiation: A Hybrid Numerical Solution for Sound Radiation without Ghost Cells

Xutong Jin, Fei Zhu, Guoping Wang, Sheng Li

Comments 11 pages

2411.12363 2026-04-21 cs.SD eess.AS 版本更新

DGSNA: Dynamic Generative Scene-based Noise Addition method

Zihao Chen, Zhentao Lin, Bi Zeng, Linyi Huang, Jia Cai

2401.10747 2026-04-21 cs.SD cs.AI cs.CL cs.LG eess.AS 版本更新

Multimodal Sentiment Analysis with Missing Modality: A Knowledge-Transfer Approach

Weide Liu, Huijing Zhan

2604.17005 2026-04-21 cs.CV cs.SD 版本更新

TeMuDance: Contrastive Alignment-Based Textual Control for Music-Driven Dance Generation

Xinran Liu, Diptesh Kanojia, Wenwu Wang, Zhenhua Feng

2604.16970 2026-04-21 eess.AS cs.SD 版本更新

A state-space representation of the boundary integral equation for room acoustic modelling

Randall Ali, Thomas Dietzen, Matteo Scerbo, Enzo De Sena, Toon van Waterschoot

Comments 14 pages, 6 figures

2604.16749 2026-04-21 cs.SD cs.CL eess.AS 版本更新

ICLAD: In-Context Learning with Comparison-Guidance for Audio Deepfake Detection

Benjamin Chou, Yi Zhu, Surya Koppisetti

Comments To appear at ACL Findings 2026

2604.16659 2026-04-21 cs.CR cs.SD 版本更新

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

Jaechul Roh, Amir Houmansadr

详情

英文摘要

Prior work shows that fine-tuning aligned models on benign data degrades safety in text and vision modalities, and that proximity to harmful content in representation space predicts which samples cause the most damage. However, existing analyses operate within a single, undifferentiated embedding space -- leaving open whether distinct input properties drive the vulnerability differently. Audio introduces a structurally richer problem: a benign sample can neighbor harmful content not only through what is said but through how it sounds, even when its words are entirely innocuous. We present the first systematic study of benign fine-tuning safety in Audio LLMs, evaluating three state-of-the-art models with a proximity-based filtering framework that selects benign audio by embedding-space distance to harmful content. By decomposing proximity into semantic, acoustic, and mixed axes using external reference encoders alongside each model's own internal encoder, we show that benign fine-tuning elevates Jailbreak Success Rate (JSR) from single digits to as high as 87.12%. Crucially, the dominant vulnerability axis and the relative risk of audio versus text fine-tuning are both architecture-conditioned -- determined by how each model's encoder and projector transform audio into the LLM's input space. We propose two defenses: filtering training data to maximize distance from harmful embeddings, and a textual system prompt at inference, both reducing JSR to near-zero without architectural modification. Our mechanistic analysis on two architectures reveals that fine-tuning selectively suppresses the late-layer refusal circuit while the frozen encoder preserves representations, and that even the suppression pattern is architecture-conditioned, mirroring the behavioral asymmetries across modalities. Safety degradation from benign fine-tuning is a qualitatively distinct risk in Audio LLMs.

URL PDF HTML ☆

赞 0 踩 0

2604.16658 2026-04-21 cs.SD 版本更新

Coexisting Tempo Traditions in Beethoven's Piano and Cello Sonatas: A K-means Clustering Analysis of Recorded Performances, 1930-2012

Ignasi Sole

2604.16617 2026-04-21 cs.CV cs.MM cs.SD 版本更新

AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers

Edson Araujo, Saurabhchand Bhati, M. Jehanzeb Mirza, Brian Kingsbury, Samuel Thomas, Rogerio Feris, James R. Glass, Hilde Kuehne

2604.16459 2026-04-21 eess.AS cs.AI cs.CV cs.LG cs.SD eess.SP 版本更新

Deep Hierarchical Knowledge Loss for Fault Intensity Diagnosis

Yu Sha, Shuiping Gou, Bo Liu, Haofan Lu, Ningtao Liu, Jiahui Fu, Horst Stoecker, Domagoj Vnucec, Nadine Wetzstein, Andreas Widl, Kai Zhou

Comments The paper has been accepted by Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 (KDD 2026)

2604.16456 2026-04-21 cs.CL cs.AI cs.LG cs.SD 版本更新

EchoChain: A Full-Duplex Benchmark for State-Update Reasoning Under Interruptions

Smit Nautambhai Modi, Gandharv Mahajan, Marc Wetter, Randall Welles

2604.16446 2026-04-21 cs.CV cs.LG cs.SD eess.AS 版本更新

A High-Accuracy Optical Music Recognition Method Based on Bottleneck Residual Convolutions

Junwen Ma, Huhu Xue, Xingyuan Zhao, and Weicheng Fu

Comments 2 figs, and 13 tables

2604.16441 2026-04-21 cs.SD cs.AI cs.CL 版本更新

iPhoneme: Brain-to-Text Communication for ALS Using ConformerXL Decoding

Yoonmin Cha, Dawit Chun, Sung Park

2603.19857 2026-04-21 cs.SD cs.CV 版本更新

FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts

You Li, Dewei Zhou, Fan Ma, Fu Li, Dongliang He, Yi Yang

Comments Accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026, 18 pages

2509.18272 2026-04-21 cs.SD cs.MM eess.AS 版本更新

StereoFoley: Object-Aware Stereo Audio Generation from Video

Tornike Karchkhadze, Kuan-Lin Chen, Mojtaba Heydari, Robert Henzel, Alessandro Toso, Mehrez Souden, Joshua Atkins

Comments Accepted to ICASSP 2026

Journal ref Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2026

2502.18309 2026-04-21 cs.GR cs.CV cs.SD eess.AS 版本更新

GCDance: Genre-Controlled Music-Driven 3D Full Body Dance Generation

Xinran Liu, Xu Dong, Shenbin Qian, Diptesh Kanojia, Wenwu Wang, Zhenhua Feng

Journal ref IEEE Transactions on Multimedia, 2026