arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

检索范围排序方式

检索时间范围

重置

HOT 人工智能、机器人等 9

cs.AI 人工智能 cs.CV 计算机视觉 cs.CL 自然语言处理 cs.RO 机器人 cs.LG 机器学习 cs.SD 声音 cs.ET 新兴技术 eess.AS 音频语音 eess.IV 图像视频

CS 计算机 41

cs 计算机 cs.AI 人工智能 cs.AR 硬件架构 cs.CC 计算复杂性 cs.CE 计算工程 cs.CG 计算几何 cs.CL 自然语言处理 cs.CR 密码安全 cs.CV 计算机视觉 cs.CY 计算机与社会 cs.DB 数据库 cs.DC 分布式计算 cs.DL 数字图书馆 cs.DM 离散数学 cs.DS 数据结构 cs.ET 新兴技术 cs.FL 形式语言 cs.GL 综述文献 cs.GR 图形学 cs.GT 博弈论 cs.HC 人机交互 cs.IR 信息检索 cs.IT 信息论 cs.LG 机器学习 cs.LO 计算机逻辑 cs.MA 多智能体 cs.MM 多媒体 cs.MS 数学软件 cs.NA 数值分析 cs.NE 神经进化 cs.NI 网络架构 cs.OH 其他计算机 cs.OS 操作系统 cs.PF 性能 cs.PL 编程语言 cs.RO 机器人 cs.SC 符号计算 cs.SD 声音 cs.SE 软件工程 cs.SI 社会信息网络 cs.SY 系统控制

ECON 经济学 4

econ 经济学 econ.EM 计量经济 econ.GN 一般经济 econ.TH 理论经济

EESS 电气与系统 5

eess 电气与系统 eess.AS 音频语音 eess.IV 图像视频 eess.SP 信号处理 eess.SY 系统控制

MATH 数学 33

math 数学 math.AC 交换代数 math.AG 代数几何 math.AP 偏微分方程 math.AT 代数拓扑 math.CA 经典分析 math.CO 组合数学 math.CT 范畴论 math.CV 复变函数 math.DG 微分几何 math.DS 动力系统 math.FA 泛函分析 math.GM 一般数学 math.GN 一般拓扑 math.GR 群论 math.GT 几何拓扑 math.HO 历史综述 math.IT 信息论 math.KT K理论 math.LO 逻辑 math.MG 度量几何 math.MP 数学物理 math.NA 数值分析 math.NT 数论 math.OA 算子代数 math.OC 优化控制 math.PR 概率 math.QA 量子代数 math.RA 环与代数 math.RT 表示论 math.SG 辛几何 math.SP 谱理论 math.ST 统计理论

PHYSICS 物理 55

astro-ph 天体物理 astro-ph.CO 宇宙学 astro-ph.EP 地球行星 astro-ph.GA 星系物理 astro-ph.HE 高能天体 astro-ph.IM 天文仪器 astro-ph.SR 太阳恒星 cond-mat 凝聚态 cond-mat.dis-nn 无序神经 cond-mat.mes-hall 介观纳米 cond-mat.mtrl-sci 材料科学 cond-mat.other 其他凝聚态 cond-mat.quant-gas 量子气体 cond-mat.soft 软凝聚态 cond-mat.stat-mech 统计力学 cond-mat.str-el 强关联电子 cond-mat.supr-con 超导 gr-qc 广义相对论 hep-ex 高能实验 hep-lat 格点高能 hep-ph 高能唯象 hep-th 高能理论 math-ph 数学物理 nlin 非线性科学 nlin.AO 自适应系统 nlin.CD 混沌动力学 nlin.CG 胞自动机 nlin.PS 斑图孤子 nlin.SI 可积系统 nucl-ex 核物理实验 nucl-th 核物理理论 physics 物理 physics.acc-ph 加速器物理 physics.ao-ph 大气海洋 physics.app-ph 应用物理 physics.atm-clus 原子分子团簇 physics.atom-ph 原子物理 physics.bio-ph 生物物理 physics.chem-ph 化学物理 physics.class-ph 经典物理 physics.comp-ph 计算物理 physics.data-an 数据分析 physics.ed-ph 物理教育 physics.flu-dyn 流体动力学 physics.gen-ph 普通物理 physics.geo-ph 地球物理 physics.hist-ph 物理史哲 physics.ins-det 仪器探测 physics.med-ph 医学物理 physics.optics 光学 physics.plasm-ph 等离子体 physics.pop-ph 科普物理 physics.soc-ph 物理与社会 physics.space-ph 空间物理 quant-ph 量子物理

Q-BIO 定量生物 11

q-bio 定量生物 q-bio.BM 生物分子 q-bio.CB 细胞行为 q-bio.GN 基因组学 q-bio.MN 分子网络 q-bio.NC 神经认知 q-bio.OT 其他定量生物 q-bio.PE 种群进化 q-bio.QM 定量方法 q-bio.SC 亚细胞过程 q-bio.TO 组织器官

Q-FIN 定量金融 10

q-fin 定量金融 q-fin.CP 计算金融 q-fin.EC 经济学 q-fin.GN 一般金融 q-fin.MF 数学金融 q-fin.PM 投资组合 q-fin.PR 证券定价 q-fin.RM 风险管理 q-fin.ST 统计金融 q-fin.TR 交易微观结构

STAT 统计 7

stat 统计 stat.AP 统计应用 stat.CO 统计计算 stat.ME 统计方法 stat.ML 机器学习 stat.OT 其他统计 stat.TH 统计理论

2604.24401 2026-04-28 cs.SD cs.AI cs.CL eess.AS 版本更新

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

Leonardo Haw-Yang Foo, Chih-Kai Yang, Chen-An Li, Ke-Han Lu, Hung-yi Lee

Comments 6 pages, 3 figures, 5 tables

2604.24386 2026-04-28 cs.SD eess.AS 版本更新

An event-based sequence modeling approach to recognizing non-triad chords with oversegmentation minimization

Leekyung Kim, Jonghun Park

Comments accepted to ICASSP 2026

2604.21164 2026-04-28 cs.SD 版本更新

MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control

Jialong Mai, Xiaofen Xing, Xiangmin Xu

Comments Release MAGIC-TTS code, pretrained models, and demo: https://github.com/yongaifadian1/MAGIC-TTS, https://huggingface.co/maimai11/MAGIC-TTS, https://yongaifadian1.github.io/MAGIC-TTS/

2604.02374 2026-04-28 cs.SD 版本更新

Evaluating Generalization and Robustness in Russian Anti-Spoofing: The RuASD Initiative

Ksenia Lysikova, Kirill Borodin, Grach Mkrtchian

Comments Submitted to IEEE Access. Under review

2601.02455 2026-04-28 cs.SD cs.CL eess.AS 版本更新

Diagnostic-Driven Layer-Wise Compensation for Post-Training Quantization of Encoder-Decoder ASR Models

Xinyu Wang, Ziyu Zhao, Yajie Luo, Yihong Wu, Liheng Ma, Jingrui Tian, Lei Ding, Xiao-Wen Chang, Peng Lu

Comments 9 pages, 4 figures, 3 tables

2509.06027 2026-04-28 cs.SD cs.AI eess.AS 版本更新

DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

Yi Yuan, Xubo Liu, Haohe Liu, Xiyuan Kang, Zhuo Chen, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang

Comments Lastest arxiv version. Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing. Demos are available at https://yyua8222.github.io/DreamAudio_demopage/

2105.12708 2026-04-28 cs.CL cs.SD eess.AS 版本更新

Multitask Learning for Grapheme-to-Phoneme Conversion of Anglicisms in German Speech Recognition

Julia Pritzen, Michael Gref, Dietlind Zühlke, Christoph Schmidt

Comments Submitted to LREC 2022

Journal ref Proceedings of the 13th Language Resources and Evaluation Conference (2022) 3242-3249

2604.23742 2026-04-28 cs.SD 版本更新

RTCFake: Speech Deepfake Detection in Real-Time Communication

Jun Xue, Zhuolin Yi, Yihuan Huang, Yanzhen Ren, Yujie Chen, Cunhang Fan, Zicheng Su, Yonghong Zhang, Bo Cai

Comments Accepted by ACL 2026

2604.23717 2026-04-28 cs.SD cs.CL 版本更新

HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models

Peize He, Yaodi Luo, Xiaoqian Liu, Xuyang Liu, Jiahang Deng, Yaosong Du, Bangyu Li, Xiyan Gui, Yuxuan Chen, Linfeng Zhang

Comments Homepage: https://dabdans.github.io/HeadRouter/

2604.23632 2026-04-28 cs.CV cs.MM cs.SD 版本更新

Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

Chunyu Li, Jiaye Li, Ruiqiao Mei, Haoyuan Xia, Hao Zhu, Jingdong Wang, Siyu Zhu

2604.23586 2026-04-28 cs.CV cs.CL cs.MM cs.SD eess.AS 版本更新

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

Zhen Ye, Xu Tan, Aoxiong Yin, Hongzhan Lin, Guangyan Zhang, Peiwen Sun, Yiming Li, Chi-Min Chan, Wei Ye, Shikun Zhang, Wei Xue

2604.23583 2026-04-28 cs.SD cs.HC 版本更新

Opening the Design Space: Two Years of Performance with Intelligent Musical Instruments

Charles Patrick Martin

Comments Accepted for publication at the International Conference on New Interfaces for Musical Expression (NIME) 2026

2604.18920 2026-04-28 cs.SD cs.CL 版本更新

Comparison of sEMG Encoding Accuracy Across Speech Modes Using Articulatory and Phoneme Features

Chenqian Le, Ruisi Li, Beatrice Fumagalli, Yasamin Esmaeili, Xupeng Chen, Amirhossein Khalilian-Gourtani, Tianyu He, Adeen Flinker, Yao Wang

2604.10708 2026-04-28 cs.SD cs.AI cs.CV cs.MM 版本更新

Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

Zeyue Tian, Binxin Yang, Zhaoyang Liu, Jiexuan Zhang, Ruibin Yuan, Hubery Yin, Qifeng Chen, Chen Li, Jing Lyu, Wei Xue, Yike Guo

2604.01897 2026-04-28 cs.SD eess.AS 版本更新

FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection

Chengyou Wang, Hongfei Xue, Chunjiang He, Jingbin Hu, Shuiyuan Wang, Bo Wu, Yuyu Ji, Jimeng Zheng, Ruofei Chen, Zhou Zhu, Lei Xie

Comments 5 pages, 2 figures

2510.05799 2026-04-28 cs.CL cs.AI cs.SD 版本更新

Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech

Rikuto Kotoge, Yuichi Sasaki

Comments Accepted at ACL 2026 (Main)

2510.00626 2026-04-28 cs.SD cs.CL 版本更新

When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models

Chen-An Li, Tzu-Han Lin, Hung-yi Lee

Comments Accepted to ICASSP 2026

2509.11717 2026-04-28 cs.SD cs.LG 版本更新

CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents

Adhiraj Banerjee, Vipul Arora

Comments main content- 27 pages, total - 53 pages, 12 figure, pre-print, under review

2506.00506 2026-04-28 eess.AS cs.SD 版本更新

Quality Assessment of Noisy and Enhanced Speech with Limited Data: UWB-NTIS System for VoiceMOS 2024

Marie Kunešová, Aleš Pražák, Jan Lehečka

Comments Submitted to ICASSP 2026

2505.15957 2026-04-28 eess.AS cs.AI cs.CL cs.SD 版本更新

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

Chih-Kai Yang, Neo S. Ho, Hung-yi Lee

Comments EMNLP 2025 (Main). Project Website: https://github.com/ckyang1124/LALM-Evaluation-Survey

2604.23323 2026-04-28 cs.CL cs.SD 版本更新

Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss

Meizhu Liu, Matthew Rowe, Amit Agarwal, Michael Avendi, Yassi Abbasi, Hitesh Laxmichand Patel, Paul Li, Kyu J. Han, Tao Sheng, Sujith Ravi, Dan Roth

2604.23241 2026-04-28 cs.SD cs.CL 版本更新

Spectro-Temporal Modulation Representation Framework for Human-Imitated Speech Detection

Khalid Zaman, Masashi Unoki

2604.22925 2026-04-28 stat.AP cs.SD 版本更新

Come Together: Analyzing Popular Songs Through Statistical Embeddings

Matthew Esmaili Mallory, Mark Glickman, Jason Brown

2604.22817 2026-04-28 eess.AS cs.CL cs.LG cs.SD 版本更新

In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions

Xulin Fan, Vishal Sunder, Samuel Thomas, Mark Hasegawa-Johnson, Brian Kingsbury, George Saon

Comments Accepted to ICASSP 2026

2601.15889 2026-04-28 eess.AS cs.SD 版本更新

A Stabilized Hybrid Active Noise Control Algorithm of GFANC and FxNLMS with Online Clustering

Zhengding Luo, Haozhe Ma, Boxiang Wang, Ziyi Yang, Dongyuan Shi, Woon-Seng Gan

Comments Accepted by 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)

Journal ref ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

2512.16378 2026-04-28 cs.CL cs.AI cs.SD 版本更新

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano, Gerard I. Gállego, Jorge Iranzo-Sánchez, Ahrii Kim, Dominik Macháček, Patricia Schmidtova, Maike Züfle

Comments Project available at https://github.com/sarapapi/hearing2translate | Accepted at TACL, this version is a pre-MIT Press publication version

2510.08618 2026-04-28 eess.AS cs.CV cs.SD 版本更新

VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models

Rui Hu, Delai Qiu, Yining Wang, Shengping Liu, Jitao Sang

Comments Accepted to ACL 2026 Main Conference

2502.12672 2026-04-28 cs.CL cs.AI cs.SD 版本更新

Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization

Tzu-Quan Lin, Wei-Ping Huang, Hao Tang, Hung-yi Lee

Comments Published in IEEE Transactions on Audio, Speech, and Language Processing (TASLP). Model and code available at: https://github.com/nervjack2/Speech-FT

Journal ref in IEEE Transactions on Audio, Speech, and Language Processing, vol. 34, pp. 70-83, 2026

详情

DOI: 10.1109/TASLPRO.2025.3635827

英文摘要

Fine-tuning speech representation models can enhance performance on specific tasks but often compromises their cross-task generalization ability. This degradation is often caused by excessive changes in the representations, making it difficult to retain information learned during pre-training. Existing approaches, such as regularizing weight changes during fine-tuning, may fail to maintain sufficiently high feature similarity with the pre-trained model, and thus could possibly lose cross-task generalization. To address this issue, we propose Speech-FT, a novel two-stage fine-tuning framework designed to maintain cross-task generalization while benefiting from fine-tuning. Speech-FT first applies fine-tuning specifically designed to reduce representational drift, followed by weight-space interpolation with the pre-trained model to restore cross-task generalization. Extensive experiments on HuBERT, wav2vec 2.0, DeCoAR 2.0, and WavLM Base+ demonstrate that Speech-FT consistently improves performance across a wide range of supervised, unsupervised, and multitask fine-tuning scenarios. Moreover, Speech-FT achieves superior cross-task generalization compared to fine-tuning baselines that explicitly constrain weight changes, such as weight-space regularization and LoRA fine-tuning. Our analysis reveals that Speech-FT maintains higher feature similarity to the pre-trained model compared to alternative strategies, despite allowing larger weight-space updates. Notably, Speech-FT achieves significant improvements on the SUPERB benchmark. For example, when fine-tuning HuBERT on automatic speech recognition, Speech-FT is able to reduce phone error rate from 5.17% to 3.94%, lower word error rate from 6.38% to 5.75%, and increase speaker identification accuracy from 81.86% to 84.11%. Speech-FT provides a simple yet powerful solution for further refining speech representation models after pre-training.

URL PDF HTML ☆

赞 0 踩 0

2412.06965 2026-04-28 cs.SD eess.AS 版本更新

Improving Music Source Separation with Diffusion and Consistency Refinement

Tornike Karchkhadze, Mohammad Rasool Izadi, Shuo Zhang, Shlomo Dubnov

2411.03109 2026-04-28 cs.SD cs.MM eess.AS 版本更新

pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues

Ziyang Jiang, Jiahe Lei, Xueyan Chen, Yifan Zhang, Zexu Pan, Wei Xue, Xinyuan Qian