arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.20478 2026-06-19 eess.AS 新提交

Beyond Speaker Independence: Evaluating Cross-Lingual Acoustic-to-Articulatory Inversion Across Finnish and Russian

超越说话人独立性：跨语言声学到发音反演在芬兰语和俄语上的评估

Ruchi Pandey, Tomi Kinnunen

AI总结本研究系统评估了跨说话人和跨语言域偏移下的声学到发音反演（AAI）性能，利用新构建的芬兰语-俄语双语EMA语料库FROST-EMA，比较了不同发音目标、声学前端和反演后端，发现跨性别性能下降中等（约0.05-0.10），跨语言下降更大（约0.10-0.20）。

详情

AI中文摘要

声学到发音反演（AAI）在域偏移下仍然具有挑战性，其中说话人属性的变化和跨语言条件常常导致性能下降。我们在这种偏移下进行了系统评估，并在FROST-EMA（一个芬兰语-俄语双语EMA语料库）上建立了基线基准。FROST-EMA解决了现有资源的英语偏见和有限的说话人多样性。我们基准测试了（i）发音目标（原始EMA坐标与声道变量），（ii）声学前端（MFCC与SSL特征），以及（iii）反演后端（BiLSTM与轻量级基于注意力的序列模型）。我们进一步定义了跨性别迁移（语言内）和跨语言迁移（性别内）的评估协议。结果表明，相对于域内基线，跨性别不匹配导致皮尔逊相关系数适度下降（约0.05至0.10），而跨语言不匹配导致更大的下降（约0.10至0.20）。

英文摘要

Acoustic-to-articulatory inversion (AAI) remains challenging under domain shifts where changes in speaker attributes and cross-language conditions often degrade performance. We conduct a systematic evaluation under such shifts and establish baseline benchmarks on FROST-EMA, a Finnish-Russian bilingual EMA corpus. FROST-EMA addresses the English bias and limited speaker diversity of existing resources. We benchmark (i) articulatory targets (raw EMA coordinates vs tract variables), (ii) acoustic front-ends (MFCC vs SSL features), and (iii) inversion back-ends (BiLSTM vs a lightweight attention-based sequence model). We further define evaluation protocols for cross-gender transfer (within language) and cross-language transfer (within gender). The results indicate that cross-gender mismatch introduces moderate Pearson correlation declines (approximately 0.05 to 0.10) relative to the in-domain baseline, whereas cross-language mismatch causes larger drops (approximately 0.10 to 0.20).

URL PDF HTML ☆

赞 0 踩 0

2606.20450 2026-06-19 eess.SP 新提交

Max-Min Rate Fairness Optimization for Multi-User Pinching-Antenna NOMA Systems

多用户捏合天线NOMA系统的最大最小速率公平性优化

Mahmoud AlaaEldin, Amy Inwood, Xidong Mu, Michail Matthaiou

AI总结针对多波导捏合天线NOMA下行系统，提出两阶段优化框架，联合优化天线位置和预编码，以最大化最小用户速率，显著提升性能。

详情

AI中文摘要

捏合天线系统（PAS）通过沿米级波导重新定位介电辐射元件（称为捏合天线，PA）来克服信号阻塞，从而创建视距链路。由于每个波导由单个射频（RF）链驱动，非正交多址（NOMA）非常适合基于PAS的多用户通信。本文研究了一个多波导的PAS使能多用户下行NOMA系统，每个波导配备多个PA。联合优化PA位置和基站发射预编码，以最大化最小用户速率。由于PA间干扰引起的快速振荡相干和，所得问题高度非光滑且非凸。为应对这一挑战，我们提出了一种两阶段结构化优化框架。在第一阶段，使用内点算法进行粗略的PA位置和功率分配优化，同时忽略PA信道相位，从而得到接近真实最优的解。在第二阶段，考虑PA信道相位偏移，对PA位置和发射预编码进行微调。该阶段首先应用相位归零，即局部重新定位每个PA，使相应信道相位归零并促进建设性相干合并。然后使用交替过程，迭代执行前后向PA位置精炼和基于逐次凸近似的复发射预编码优化直至收敛，从而减少残余相位失配。仿真结果表明，所提框架显著优于启发式优化基准，且计算时间更短。结果还展示了相对于可比的多输入多输出下行NOMA系统的巨大增益，并揭示了PA数量、用户数量和发射功率对系统性能的影响。

英文摘要

Pinching-antenna systems (PASs) can overcome signal blockage by repositioning dielectric radiating elements, called pinching antennas (PAs), along meter-scale waveguides to create line-of-sight links. Since each waveguide is driven by a single radio-frequency (RF) chain, non-orthogonal multiple access (NOMA) is well suited for PAS-based multi-user communications. This paper studies a PAS-enabled multi-user downlink NOMA system with multiple waveguides, each equipped with multiple PAs. The PA positions and base-station transmit precoding are jointly optimized to maximize the minimum user rate. The resulting problem is highly non-smooth and non-convex because of the rapidly oscillating coherent sums caused by inter-PA interference. To tackle this challenge, we propose a two-stage structured optimization framework. In the first stage, coarse PA-position and power-allocation optimization is performed using an interior-point algorithm while neglecting the PA channel phases, which gives solutions near the true optima. In the second stage, PA positions and transmit precoding are fine-tuned while accounting for the PA channel phase shifts. This stage first applies phase zeroing, where each PA is locally repositioned to align the corresponding channel phase toward zero and promote constructive coherent combining. It then uses an alternating procedure that iteratively performs forward-backward PA position refinement and successive-convex-approximation-based complex transmit precoding optimization until convergence, thereby reducing residual phase mismatch. Simulation results show that the proposed framework significantly outperforms heuristic optimization benchmarks with much lower computational time. They also demonstrate large gains over a comparable multiple-input multiple-output downlink NOMA system and reveal the impact of the number of PAs, users, and transmit power on system performance.

URL PDF HTML ☆

赞 0 踩 0

2606.20338 2026-06-19 eess.AS 新提交

Stuttering Classification and Segmentation with Attention-Based Multiple Instance Learning

基于注意力多实例学习的口吃分类与分割

Petar Sušac, Sebastian P. Bayerl, Hrvoje Džapo

AI总结提出基于微调wav2vec 2.0、WavLM和Whisper编码器的多实例神经网络，利用片段级数据实现帧级口吃分类与分割，帧级F1提升23%。

Comments Accepted at Interspeech 2026

2606.20266 2026-06-19 eess.AS 新提交

Transcript-Free Flow-Matching Text-to-Speech via Speech Feature Conditioning

基于语音特征调节的无转录流匹配文本转语音

SooHwan Eom, Hee Suk Yoon, Eunseop Yoon, Mark Hasegawa-Johnson, Chang D. Yoo

AI总结提出RTFree-F5，用自监督语音表示替代参考转录本，通过轻量适配器映射到F5-TTS文本条件空间，消除对外部ASR依赖，在构音障碍语音上WER从24.6%降至10.4%。

Comments Accepted to Interspeech 2026

详情

AI中文摘要

最近的流匹配文本转语音（TTS）模型，如F5-TTS，在推理时依赖于从外部ASR系统获得的参考转录本。这种依赖性使得零样本TTS对于口音或构音障碍的说话者变得脆弱，而这正是最需要它的场景。此外，我们发现即使有真实转录本可用，基于文本的参考条件化也可能将非典型语音中的非典型声学模式传播到合成语音中。为了解决这个问题，我们提出了RTFree-F5，它用连续的自监督语音表示替换参考转录本，通过轻量适配器映射到F5-TTS的文本条件空间，同时重用预训练检查点。在构音障碍语音上，RTFree-F5将WER从24.6%降低到10.4%，甚至超过了真实参考转录本基线，同时提高了自然度，并在标准基准测试中保持竞争力，而无需任何参考转录本。

英文摘要

Recent flow-matching text-to-speech (TTS) models, such as F5-TTS, rely on a reference transcript at inference time, obtained from an external ASR system. This dependency makes zero-shot TTS brittle for accented or dysarthric speakers, precisely the scenarios where it is most needed. Moreover, we find that text-based reference conditioning can propagate atypical acoustic patterns from atypical speech into synthesis, even when ground-truth transcripts are available. To address this, we propose RTFree-F5, which replaces the reference transcript with continuous self-supervised speech representations mapped into F5-TTS's text-conditioning space via a lightweight adapter, while reusing the pretrained checkpoint. On dysarthric speech, RTFree-F5 reduces WER from 24.6% to 10.4%, surpassing even the ground-truth reference transcript baselines, while improving naturalness and remaining competitive on standard benchmarks without requiring any reference transcript.

URL PDF HTML ☆

赞 0 踩 0

2606.20222 2026-06-19 eess.SP 新提交

Reliable ORIS-assisted FSO Communications via HARQ

基于HARQ的可靠ORIS辅助自由空间光通信

Georgios D. Chondrogiannis, Athanasios P. Chrysologou, Vasilis K. Papanikolaou, Alexandros-Apostolos A. Boulogeorgos, Nestor D. Chatzidiamantis, Robert Schober

AI总结研究结合光学可重构智能表面（ORIS）和混合自动重传请求（HARQ）的自由空间光通信链路，推导端到端信道统计模型，给出HARQ-CC的闭式中断概率和HARQ-IR的中断上界，分析分集阶数和延迟特性。

Comments 13 pages, 8 Figures, Journal

详情

AI中文摘要

本文研究了一种由光学可重构智能表面（ORIS）辅助并通过混合自动重传请求（HARQ）方案增强的自由空间光（FSO）链路。ORIS在障碍物周围创建虚拟视距路径，而HARQ通过重传和合并恢复受湍流、指向抖动和几何损耗损坏的帧。我们首先通过联合考虑大气湍流、ORIS引起的指向误差和几何衰减，推导了端到端发射器-ORIS-接收器（Tx-ORIS-Rx）反射信道的易处理统计模型。基于这些结果，我们获得了采用Chase合并的HARQ（HARQ-CC）的闭式中断概率（OP）表达式，以及采用增量冗余的HARQ（HARQ-IR）的解析中断上界，这些表达式对任意最大传输轮次有效。我们进一步进行了高信噪比（SNR）分析，该分析提供了中断行为的全面表征，并揭示了两种方案的分集阶数。此外，我们通过平均传输轮次和给定成功解码的条件平均轮次来表征截断HARQ过程的延迟行为。最后，数值和蒙特卡洛结果验证了所提出的分析，并表明HARQ显著提高了ORIS辅助FSO的可靠性，即使对于少量重传轮次，HARQ-IR也能实现比HARQ-CC更低的中断和延迟。

英文摘要

This paper studies a free-space optical (FSO) link assisted by an optical reconfigurable intelligent surface (ORIS) and enhanced by a hybrid automatic repeat request (HARQ) scheme. The ORIS creates a virtual line-of-sight path around obstacles, while HARQ recovers frames corrupted by turbulence, pointing jitter, and geometric loss through retransmission and combining. We first derive a tractable statistical model for the end-to-end transmitter-ORIS-receiver (Tx-ORIS-Rx) reflected channel by jointly accounting for atmospheric turbulence, ORIS-induced pointing errors, and geometric attenuation. Building on these results, we obtain closed-form outage probability (OP) expressions for HARQ with Chase combining (HARQ-CC) and analytical outage upper bounds for HARQ with incremental redundancy (HARQ-IR), valid for an arbitrary maximum number of transmission rounds. We further conduct a high signal-to-noise ratio (SNR) analysis that provides a thorough characterization of the outage behavior and reveals the diversity order of both schemes. In addition, we characterize the delay behavior of the truncated HARQ process through the mean number of transmission rounds and the conditional mean number of rounds given successful decoding. Finally, numerical and Monte Carlo results validate the proposed analysis and show that HARQ substantially improves ORIS-assisted FSO reliability, with HARQ-IR achieving lower outage and delay than HARQ-CC, even for a small number of retransmission rounds.

URL PDF HTML ☆

赞 0 踩 0

2606.20011 2026-06-19 eess.SP 新提交

Amplitude-Phase-Frequency Block Modulation for OFDM-ISAC with SI-Free PAPR Reduction and Pilotless Sensing

用于OFDM-ISAC的幅度-相位-频率块调制：无旁瓣信息PAPR降低和无导频感知

Bensheng Yang, Min Fan, Haitao Zhao, Haiming Wang

AI总结提出一种幅度-相位-频率块调制方案，通过斯托克斯球映射和分组相位优化，在OFDM中实现无资源分割的通信与感知集成，同时降低PAPR并消除导频需求。

详情

AI中文摘要

基于正交频分复用（OFDM）的集成感知与通信系统需要一种统一波形，同时支持可靠数据传输、低峰均功率比（PAPR）和精确信道感知。现有方法在分离的时间或频率资源上复用通信与感知，或依赖专用导频进行信道估计，限制了系统灵活性并增加了开销。本文提出一种用于OFDM的幅度-相位-频率块调制（APFBM）方案，在不进行资源分割的情况下实现通信与感知的波形级集成。信息符号在斯托克斯球上表示，并通过明确规则映射到能量归一化的琼斯矢量，该规则为每个块建立确定性相位参考。这种映射暴露了信号结构中固有的共相自由度。在发射端，分组相位优化算法利用该结构自由度降低PAPR，无需旁瓣信息（SI）。在接收端，相同的确定性相位结构支持基于维特比的最大似然（ML）序列检测算法，该算法联合恢复优化相位并估计块状信道幅度和相位。无需专用感知导频，因为感知观测量直接从通信波形中提取。推导了闭式错误率和感知精度表达式。在软件无线电链路上的数值仿真和空中测量证实了有效的PAPR降低、精确的信道感知、可靠的相位恢复和稳定的信道状态信息重建。所提方案以适度降低频谱效率为代价，实现了统一波形设计，同时提供无SI的PAPR降低和无导频感知。

英文摘要

Orthogonal Frequency Division Multiplexing (OFDM)-based integrated sensing and communication systems demand a unified waveform that simultaneously supports reliable data transmission, low peak-to-average power ratio (PAPR), and accurate channel sensing. Existing approaches multiplex communication and sensing across separate time or frequency resources, or rely on dedicated pilots for channel estimation, limiting system flexibility and increasing overhead. This paper proposes an amplitude-phase-frequency block modulation (APFBM) scheme for OFDM that achieves waveform-level integration of communication and sensing without resource partitioning. Information symbols are represented on the Stokes sphere and mapped to energy-normalized Jones vectors through an unambiguous rule that establishes a deterministic phase reference per block. This mapping exposes a commonphase degree of freedom inherent in the signal structure. At the transmitter, a grouped phase optimization algorithm exploits this structural freedom to reduce the PAPR without side information (SI). At the receiver, the same deterministic phase structure enables a Viterbi-based maximum-likelihood (ML) sequence detection algorithm that jointly recovers the optimization phases and estimates the block-wise channel amplitude and phase. No dedicated sensing pilots are required, as the sensing observables are extracted directly from the communication waveform. Closed-form error-rate and sensing-accuracy expressions are derived. Numerical simulations and over-the-air measurements on a software-defined radio link confirm effective PAPR reduction, accurate channel sensing, reliable phase recovery, and stable channel state information reconstruction. The proposed scheme trades a moderate reduction in spectral efficiency for a unified waveform design that simultaneously delivers SI-free PAPR reduction and pilotless sensing.

URL PDF HTML ☆

赞 0 踩 0

2606.20001 2026-06-19 eess.AS 新提交

Time-Unconditional Generative Speech Enhancement via Autonomous Rectified Flow

基于自主整流流的时间无条件生成式语音增强

Wen Zhang, Wenbin Jiang, Yang Zhang, Xiaofei Zhou

AI总结提出自主整流流框架，通过线性插值路径证明目标向量场时间不变性，设计时间无条件网络仅从空间关系推断去噪方向，显著提升生成质量、鲁棒性和推理效率。

2606.19974 2026-06-19 eess.AS 新提交

Interpreting Content and Speaker Characteristics in Factorised Self-Supervised Subspaces

解释因子化自监督子空间中的内容和说话人特征

Kyle Janse van Rensburg, Herman Kamper

AI总结通过SVD分解WavLM特征为内容矩阵和说话人变换，发现内容空间主要编码强度、共振峰和发声，而说话人空间与音高和性别强相关，并可用于语音合成中的精细控制。

Comments 7 pages, 4 figures

详情

AI中文摘要

自监督语音特征同时编码内容和说话人信息。最近的工作引入了一种基于SVD的因子化方法，将这些特征分解为一个共享的内容矩阵（捕获时间变化）和说话人特定的变换（捕获静态说话人特征）。然而，这些组件内部的信息组织方式仍不清楚。在本文中，我们研究了WavLM因子化的内容和说话人子空间的维度如何与语音特征（如音高、强度和发声）相关。我们发现，内容空间中的前几个维度主要捕获强度、高阶共振峰和发声，而音高编码在较后的维度中。相比之下，方差最大的说话人维度与音高和性别强相关，后面的维度捕获高频变化。干预实验表明，操纵这些维度能够实现对语音合成中语音特征的目标控制。此外，联合修改内容和说话人表示可提供对音高和强度等特征的精细控制。

英文摘要

Self-supervised speech features encode both content and speaker information. Recent work introduced an SVD-based factorisation that decomposes these features into a shared content matrix capturing temporal variation and speaker-specific transformations capturing static speaker characteristics. However, how information is organised within these components remains unclear. In this paper, we investigate how the dimensions of WavLM-factorised content and speaker subspaces correlate with speech characteristics such as pitch, intensity, and voicing. We find that leading dimensions in the content space primarily capture intensity, higher-order formants, and voicing, while pitch is encoded in a later dimension. In contrast, the highest-variance speaker dimension is strongly associated with pitch and gender, with later dimensions capturing high-frequency variation. Intervention experiments show that manipulating these dimensions enables targeted control of speech characteristics for speech synthesis. Furthermore, modifying the content and speaker representations jointly provides fine-grained control over characteristics such as pitch and intensity.

URL PDF HTML ☆

赞 0 踩 0

2606.19953 2026-06-19 eess.SP 新提交

ConsisFormer: Compute-Efficient Transformer for Wireless Foundation Models Based on Channel Consistency

ConsisFormer: 基于信道一致性的无线基础模型高效计算Transformer

Yuwei Wang, Li Sun, Tingting Yang, Liwen Jing, Yuxuan Shi, Maged Elkashlan, Mérouane Debbah

AI总结提出ConsisFormer，利用无线信道短时一致性，通过自适应令牌聚合和特征序列插值降低Transformer计算复杂度，在多种任务上减少83%以上计算量且性能损失极小。

详情

AI中文摘要

无线基础模型（WFM）最近成为AI原生6G网络的一种有前景的范式，能够实现适应各种通信和感知任务的通用信道表示。现有的WFM主要基于Transformer架构，该架构提供了优越的性能，但计算复杂度与输入序列长度的平方成正比，这对其在严格推理延迟约束下的部署构成了重大障碍。为了解决这个问题，本文提出ConsisFormer，一种基于无线信道短时一致性的高效计算Transformer设计，作为WFM的骨干网络。利用相邻时间或频率实例共享相似的散射体簇并因此表现出相似信道特性的观察，我们开发了自适应令牌聚合（ATA）模块，动态合并相邻信道状态信息（CSI）令牌，从而减少自注意力计算中涉及的令牌序列长度以降低计算成本。此外，我们提出了一种特征序列插值（FSI）方法，基于Transformer块输出的稀疏特征序列恢复完整的CSI表示，从而在保持性能不受影响的同时确保低复杂度。此外，我们提出了一种用于WFM的聚合自编码器（AAE）预训练范式，通过压缩和恢复从稀疏化CSI令牌中学习鲁棒的信道表示。仿真结果表明，所提出的设计将WFM的计算复杂度降低了83%以上，同时在包括信道预测、视距/非视距分类、波束预测和定位在内的各种任务上性能损失极小。

英文摘要

Wireless foundation models (WFMs) have recently emerged as a promising paradigm for AI-native 6G networks, enabling universal channel representations adaptable to diverse communication and sensing tasks. Existing WFMs are predominantly built upon the Transformer architecture, which delivers superior performance but incurs computational complexity proportional to the square of the input sequence length, posing a significant barrier to their deployment under stringent inference latency constraints. To address this issue, in this paper, we propose ConsisFormer, a compute-efficient Transformer design based on short-term consistency of wireless channels, as a WFM backbone. By utilizing the observation that adjacent time or frequency instances share similar clusters of scatterers and thus exhibit similar channel characteristics, we develop an adaptive token aggregation (ATA) module to dynamically merge neighboring channel state information (CSI) tokens, thereby reducing the length of the token sequence involved in self-attention calculations to lower the computational cost. Furthermore, we propose a feature sequence interpolation (FSI) method to recover the full CSI representation based on the sparse feature sequence outputted from the Transformer blocks, thus keeping the performance unaffected while ensuring low complexity. Moreover, we propose an aggregated auto-encoder (AAE) pre-training paradigm for WFMs, enabling robust channel representation learning from sparsified CSI tokens via compression and recovery. Simulation results show that the proposed design reduces the computational complexity of WFM by over $83\%$ with negligible performance loss on various tasks including channel prediction, LoS/NLOS classification, beam prediction, and localization.

URL PDF HTML ☆

赞 0 踩 0

2606.19940 2026-06-19 eess.AS 新提交

Analyzing Language and Geographical Variation in Speech Representations Across 60 Indic Languages

分析60种印度语言语音表征中的语言和地理变异

Pavan Kumar J, Agneedh Basu, Pranav Bhat, Sujith Pulikodan, Visruth Sanka, Nihar Desai, Prasanta Kumar Ghosh

AI总结研究通过联合语言-地区监督微调Whisper-base和Wav2Vec2.0，发现该方法在保持语言分类能力的同时，提升了嵌入空间中地区区分度，并利用归一化条件互信息分析了嵌入结构。

2606.19724 2026-06-19 eess.SP 新提交

Cyclic-Prefix OFDM Probing for Spatial-ISI-Free Distributed Acoustic Sensing via Frequency-Domain Channel Reconstruction

基于频域信道重构的循环前缀OFDM探测实现无空间ISI分布式声学传感

Huan Huang, Zhiyang Xue, Ziang Chen, Zhongxing Tian, Dongdong Zou, Gangxiang Shen, Yi Cai

AI总结提出使用循环前缀正交频分复用（CP-OFDM）波形作为传感探头，通过频域信道重构消除匹配滤波脉冲压缩中的空间符号间干扰（ISI），实现无空间ISI的分布式声学传感，并同时恢复通信数据，展示共享波形集成感知与通信（ISAC）。

Comments This manuscript has been submitted for possible publication

详情

AI中文摘要

基于匹配滤波的脉冲压缩分布式声学传感（DAS）存在非零压缩旁瓣，导致确定性距离单元间泄漏，即空间符号间干扰（ISI），并在重建的瑞利背向散射迹中产生虚假响应。我们提出一种用于$\phi$-OTDR的循环前缀正交频分复用（CP-OFDM）DAS系统，使用承载数据的CP-OFDM波形作为传感探头。该系统还恢复前向通信数据，初步展示了共享波形集成感知与通信（ISAC）。据我们所知，这是首次将分布式瑞利背向散射建模为有限记忆传感多径信道。基于该模型，我们证明，如果有用OFDM和CP长度覆盖传感多径记忆，则去除CP、单抽头频域均衡和逆离散傅里叶变换可重建每个距离单元系数，且无确定性波形引起的空间ISI，从而实现无空间ISI的相位解调。在模拟的5.2公里链路上，组内间隔5.31–5.83米的十个同时强、弱事件，所提接收机抑制了事件外泄漏，并将相位迹均方误差相比匹配滤波脉冲压缩提升高达29.55 dB。在5.2公里光纤链路的相干外差实验中，占用带宽111.984 MHz，在5 V和1 V驱动下，500 Hz PZT振动分别被盲定位在5.071公里和5.066公里处，其波形恢复的相关系数分别为0.990和0.962。同一承载数据探头还恢复了一幅图像，误码率为零，误差矢量幅度中位数为-23.14 dB。这些结果验证了CP-OFDM辅助的频域信道重构用于无空间ISI的DAS，并展示了其在共享波形光纤ISAC中的潜力。

英文摘要

Matched-filter-based pulse-compression distributed acoustic sensing (DAS) suffers from nonzero compression sidelobes that cause deterministic inter-range-bin leakage, i.e., spatial inter-symbol interference (ISI), and false responses in reconstructed Rayleigh-backscatter traces. We propose a cyclic-prefix orthogonal frequency-division multiplexing (CP-OFDM) DAS system for $ϕ$-OTDR, using a data-bearing CP-OFDM waveform as the sensing probe. It also recovers forward communication data, providing an initial demonstration of shared-waveform integrated sensing and communication (ISAC). To our knowledge, this is the first formulation of distributed Rayleigh backscattering as a finite-memory sensing multipath channel. Based on this formulation, we prove that, if the useful OFDM and CP lengths cover the sensing multipath memory, CP removal, one-tap frequency-domain equalization, and inverse discrete Fourier transform reconstruct each range-bin coefficient without deterministic waveform-induced spatial ISI, enabling spatial-ISI-free phase demodulation. For a simulated 5.2-km link with ten simultaneous strong and weak events spaced by 5.31--5.83 m within groups, the proposed receiver suppresses off-event leakage and improves phase-trace mean-square error by up to 29.55 dB over matched-filter pulse compression. In a heterodyne coherent experiment over a 5.2-km fiber link with 111.984-MHz occupied bandwidth, 500-Hz PZT vibrations are blindly localized at 5.071 and 5.066 km under 5- and 1-V drives, respectively, and their waveforms are recovered with correlation coefficients of 0.990 and 0.962. The same data-bearing probe also recovers an image with zero measured bit-error rate and a median error vector magnitude of -23.14 dB. These results validate CP-OFDM-aided frequency-domain channel reconstruction for spatial-ISI-free DAS and demonstrate its potential for shared-waveform optical-fiber ISAC.

URL PDF HTML ☆

赞 0 踩 0

2606.19720 2026-06-19 eess.SP 新提交

An Optimization Framework for Certain Separable Problems using Neural Networks

基于神经网络的特定可分离问题优化框架

Rohit Negi, Soummya Kar

AI总结针对参数可分离的约束优化问题，提出离线学习与在线处理两阶段策略，利用ADMM和神经网络降低在线计算复杂度。

Comments 15 pages, 5 figures

2606.19666 2026-06-19 eess.SP 新提交

Degrees of Freedom and Beamforming for Large Intelligent Surfaces

大规模智能表面的自由度与波束赋形

Jiawang Li, Alireza Saberkari, Buon Kiong Lau, Mats Gustafsson

AI总结通过互阴影面积闭式表达式估计大规模智能表面（LIS）的空间自由度（DoF），并验证其与数值奇异值谱的吻合；基于DoF分析设计采样方案和波束赋形，证明可形成约DoF数量的独立波束，超过此限会导致干扰增加；极化研究表明电场分量对DoF贡献不均，总场DoF为单极化分量的两倍。

详情

AI中文摘要

空间自由度（DoF）、采样和波束赋形是多用户大规模智能表面（LIS）的基础，其中电磁场必须在多个近场位置进行成形、分辨和聚焦。本文利用互阴影面积的闭式表达式，针对代表性LIS配置估计了DoF数量。通过数值奇异值谱验证了所得DoF预测，其谱膝点与理论估计紧密吻合。对于线源配置，通过将源或观测线划分为单位DoF区间，开发了一种解析采样方案，从而能够选择空间样本。使用最大比传输和迫零的波束赋形结果表明，可以形成大约DoF数量的独立波束。试图超过此限制会导致干扰增加和性能下降。对于基于表面的LIS配置，采样点则通过离散经验插值方法数值确定。相应的波束赋形结果进一步证实，目标区域可以支持大约与DoF分析预测数量相同的独立波束。最后，一项极化感知研究表明，电场分量对DoF的贡献不相等，且总场DoF是单极化分量DoF的两倍。

英文摘要

Spatial degrees of freedom (DoF), sampling, and beamforming are fundamental to multi-user large intelligent surfaces (LISs), where electromagnetic fields must be shaped, resolved, and focused at multiple near-field locations. This work estimates the number of DoF using closed-form expressions derived from the mutual shadow area for representative LIS configurations. The resulting DoF predictions are validated through numerical singular-value spectra, whose spectral knee points closely match the theoretical estimates. For line-source configurations, an analytic sampling scheme is developed by partitioning the source or observation line into unit-DoF intervals, enabling the selection of spatial samples. Beamforming results using maximum-ratio transmission and zero-forcing demonstrate that approximately the number of DoF independent beams can be formed. Attempting to exceed this limit results in increased interference and degraded performance. For surface-based LIS configurations, sampling points are instead determined numerically using the discrete empirical interpolation method. The corresponding beamforming results further confirm that the target region can support approximately as many independent beams as predicted by the DoF analysis. Finally, a polarization-aware study reveals that the electric-field components contribute unequally to the DoF and that the total-field DoF is twice that of a single polarization component.

URL PDF HTML ☆

赞 0 踩 0

2606.19536 2026-06-19 eess.SP 新提交

Multistatic J-Band Radar TX/RX Chipset in SiGe BiCMOS with Integrated x16 Frequency Multiplier Chain and High EIRP

采用SiGe BiCMOS工艺的集成x16倍频链和高EIRP的多基地J波段雷达收发芯片组

Stephan Hauptmeier, Kennet Braasch, Till Ziegler-Bellenberg, Diana P. Cortes N., Tobias T. Braun, Michael Höft, Nils Pohl

AI总结本文设计并测量了一种多基地J波段雷达芯片组，包含集成x16倍频链的发射和接收MMIC，实现了高EIRP和远距离探测。

详情

AI中文摘要

本文介绍了一种多基地J波段雷达芯片组的设计与测量，该芯片组包括一个发射机和一个接收机MMIC，两者均集成了$\ imes$16倍频链，用于低频本振分配和可扩展雷达配置。多基地雷达架构可以同时维持高发射功率和高接收灵敏度，这一优势在本芯片组中得到了充分利用。为此，发射机MMIC上集成的四路功率合成放大器链提供了11.2 dBm的输出功率。在292 GHz下，使用准直PTFE透镜时测得的EIRP为41 dBm，无透镜时为8.8 dBm。尽管倍频因子较高，但片上谐波抑制优于24 dBc，而通过多个滤波器级实现了约50 dBc的辐射带内谐波抑制。接收机MMIC包含三级低噪声放大器，在292 GHz下整体转换增益为43.3 dB。集成的片上贴片天线便于系统集成，并可使用高方向性介质透镜，使该芯片组适用于长达150米的远距离雷达测量。MMIC采用130 nm SiGe BiCMOS工艺实现，其f_T和f_max分别为500 GHz和610 GHz。

英文摘要

This work presents the design and measurement of a multistatic J-band radar chipset comprising a transmitter and a receiver MMIC both featuring an integrated $times$16 frequency multiplier chain for low-frequency local-oscillator distribution and scalable radar configurations. Multistatic radar architectures can sustain high transmission power and high receiver sensitivity simultaneously an advantage that is fully leveraged in the present chipset. To this end a four-way power-combining amplifier chain integrated on the transmitter MMIC delivers an output power of 11.2 dBm. The resulting measured EIRP is 41 dBm at 292 GHz with a collimating PTFE lens and 8.8 dBm without a lens. Despite the high frequency-multiplication factor an on-chip harmonic rejection better than 24 dBc was measured while a radiated in-band harmonic rejection of approximately 50 dBc was achieved through multiple filter stages. The receiver MMIC incorporates a three-stage low-noise amplifier and exhibits an overall conversion gain of 43.3 dB at 292 GHz. Integrated on-chip patch antennas facilitate system integration and the use of highly directive dielectric lenses making the chipset suitable for long-range radar measurements which are demonstrated up to 150 m. The MMICs are realized in a 130 nm SiGe BiCMOS technology with an f_T and f_max of 500 GHz and 610 GHz respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.19453 2026-06-19 eess.AS 新提交

A Survey of Full-Duplex Spoken Dialogue Systems: Architectural Hierarchy, Interaction Ontology, and Decision State Machine

全双工口语对话系统综述：架构层次、交互本体与决策状态机

Jingyu Lu, Yuhan Wang, Jianming Luo, Yifu Chen, Tianle Liang, Shengpeng Ji, Ziyue Jiang, Xiaoda Yang, Yu Zhang, Xize Cheng, Chenyuhao Wen, Changhao Pan, Haoxiao Wang, Chen Ye, Jian Wu, Xiaoxi Jiang, Guanjun Jiang, Zhou Zhao

AI总结针对全双工术语歧义，提出L0-L3架构层次、T×I×R交互本体和IDLE/LISTEN/SPEAK/WAIT/DUAL决策状态机三个框架，揭示现有系统在训练与评估中的实现差距。

Comments 34 pages, 5 figures, 7 tables. Project page and interactive demo: https://github.com/DuplexLM/DuplexSurvey

详情

AI中文摘要

近期有十余个口语对话系统声称实现了“全双工”，但该术语被用于描述本质上不同的能力。现有综述将它们归入单一轴（级联/端到端，或工程化/学习型），忽略了构建者最关心的区别。我们认为这种歧义很大程度上源于分类学问题：当前术语未明确双工决策在何处做出、支持哪些交互类型、以及系统如何逐时刻行为。本文引入三个互补框架：(i) L0-L3架构层次，定位双工决策位置；(ii) T×I×R交互本体，指定每次交互的时间关系、用户意图和所需系统响应；(iii) 决策状态机（IDLE/LISTEN/SPEAK/WAIT/DUAL），描述系统如何在状态间转换。通过对已发表系统和基准的审计，我们记录了一个实现差距：尽管许多架构原则上能在全双工状态下运行，但其观察到的行为仍受训练和评估中表示的交互模式约束。我们指出，相对于（大多未公开的）工业语料库，有限的公开训练数据覆盖范围，以及尚未实现的L3表示级建模目标，是全双工对话未来研究的关键前沿。相关材料见https://this https URL。

英文摘要

More than a dozen spoken dialogue systems have recently claimed to be "full-duplex," yet the term has been used to describe substantially different capabilities. Existing surveys collapse them onto a single axis (cascaded/end-to-end, or engineered/learned) and miss the distinctions that matter most for builders. We argue that much of this ambiguity is taxonomical: current terminology does not specify where duplex decisions are made, which interaction types are supported, or how a system behaves moment by moment. This paper introduces three complementary frameworks: (i) an L0-L3 Architectural Hierarchy that locates where duplex decisions are made; (ii) a $T\times I\times R$ Interaction Ontology that specifies the temporal relation, user intent, and required system response for each interaction; and (iii) a Decision State Machine (IDLE/LISTEN/SPEAK/WAIT/DUAL) that describes how systems move between states. Across published systems and benchmarks, our audit documents a realization gap: although many architectures can in principle operate in full-duplex states, their observed behavior remains constrained by the interaction patterns represented in training and evaluation. We point to the limited public training-data coverage relative to the (largely undisclosed) industrial corpora, together with the still-unrealized goal of L3 representation-level modeling, as the key frontiers for future research on full-duplex dialogue. The related material is available at https://github.com/DuplexLM/DuplexSurvey.

URL PDF HTML ☆

赞 0 踩 0

2606.20457 2026-06-19 eess.AS cs.AI cs.LG 新提交

Repurposing a Speech Classifier for Guided Diffusion-Based Speech Generation

重新利用语音分类器进行基于引导扩散的语音生成

Rostislav Makarov, Timo Gerkmann

AI总结提出将预训练的语音分类器作为扩散生成的主干，通过附加轻量子网络并仅训练该子网络，实现单主干模型的高质量条件语音生成，降低内存和计算成本。

Comments Accepted for publication in the Proceedings of Interspeech 2026

详情

AI中文摘要

分类器引导是一种通过使用噪声条件分类器将采样过程导向目标类别来控制扩散生成的方法。分类器引导的一个缺点是需要两个单独训练的模型：一个分类器和一个扩散模型。因此，我们研究了一种更紧凑的替代方案，其中将传统训练的语音分类器重新用作扩散生成的主干。从log-Mel空间中的冻结噪声条件分类器开始，我们附加一个轻量子网络，该子网络重用中间分类器表示，并在去噪分数匹配目标下仅训练该子网络。我们的工作表明，预训练的分类器可以重新用于条件生成，为判别建模和条件语音合成之间提供了有吸引力的桥梁，从而在单主干模型中实现高语音质量，同时减少内存占用和计算成本。

英文摘要

Classifier guidance is a way to control diffusion generation by using a noise-conditioned classifier to steer the sampling process toward a target class. One drawback of classifier guidance is that it requires two separately trained models: a classifier and a diffusion model. We therefore study a more compact alternative in which a conventionally trained speech classifier is repurposed as the backbone for diffusion generation. Starting from a frozen noise-conditioned classifier in log-Mel space, we attach a lightweight subnetwork that reuses intermediate classifier representations and train only this subnetwork under a Denoising Score Matching objective. Our work shows that a pretrained classifier can be repurposed for conditional generation, providing an appealing bridge between discriminative modeling and conditional speech synthesis resulting in high speech quality within a single-backbone model, with reduced memory footprint and computational cost.

URL PDF HTML ☆

赞 0 踩 0

2606.20137 2026-06-19 eess.AS cs.CL cs.LG cs.SD 新提交

PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors

PASQA：针对重音错误的合成语音训练的以音高重音为中心的语音质量评估模型

Masaya Kawamura, Yuma Shirahata, Kentaro Mitsui, Reo Shimizu

AI总结提出PASQA模型，通过可控重音合成数据集和伪重音质量分数，结合自监督表示、摩拉条件融合等训练策略，有效评估音高重音正确性，优于传统MOS模型。

Comments Accepted to INTERSPEECH 2026

详情

AI中文摘要

现有的平均意见得分（MOS）预测模型通常预测话语级别的自然度MOS，并且可能对局部音高重音错误不敏感。我们提出了以音高重音为中心的语音质量评估（PASQA），明确针对音高重音正确性。为了训练我们的模型，我们使用重音可控的文本转语音系统通过改变重音模式构建了一个受控的日语重音错误数据集，并根据重音错误率计算伪重音质量得分。PASQA建立在自监督表示的基础上，并采用摩拉条件融合、排序损失、辅助重音错误定位任务和说话者不变训练。实验表明，传统模型无法保持按重音错误严重程度的排序，而PASQA在已见和未见说话者上都实现了高排序准确性。此外，PASQA与人类重音正确性判断的一致性更强。代码可在以下网址获取：https://this URL。

英文摘要

Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA), which explicitly targets pitch-accent correctness. To train our model, we construct a controlled Japanese accent-error dataset by changing accent patterns using an accent-controllable text-to-speech system, and compute a pseudo accent-quality score from the accent-error rate. PASQA builds on self-supervised representations and employs mora-conditioned fusion, ranking loss, an auxiliary accent-error localization task, and speaker-invariant training. Experiments show that conventional models fail to preserve the ordering by accent-error severity, whereas PASQA achieves high ordering accuracy on both seen and unseen speakers. Further, PASQA shows stronger agreement with human accent-correctness judgments. The code is available at https://github.com/lycorp-jp/PASQA.

URL PDF HTML ☆

赞 0 踩 0

2606.20106 2026-06-19 eess.AS cs.SD 新提交

Personalized Keyword Spotting for User-Defined Keywords Leveraging Text-Independent Speaker Verification

利用文本无关说话人验证的用户自定义关键词个性化唤醒

Ming-Hsiang Hu, Kuan-Tang Huang, Chien-Chun Wang, Hung-Shin Lee, Berlin Chen

AI总结提出ZP-KWS轻量框架，结合音素监督音频编码器和紧凑说话人编码器，通过乘法后融合实现零样本关键词检测与说话人验证，在多个数据集上将目标误拒率降低高达60%。

Comments Accepted to Interspeech 2026

2606.20074 2026-06-19 eess.SP cs.AI cs.LG 新提交

Evaluation of EEG Foundation Models for Event-Based Burst-Suppression Detection in ICU

用于ICU中基于事件的爆发-抑制检测的EEG基础模型评估

Elisa Vasta, Thorir Mar Ingolfsson, Andrea Cossettini, Luca Benini, Tilman Beck, Emanuela Keller, Una Pale

AI总结本研究首次评估EEG基础模型在ICU中无需患者校准的爆发检测性能，REVE-base模型在事件级F1分数上达到0.868，并将每分钟爆发错误率分别降低52.1%和36.2%。

Comments 4 pages, 1 figure. Code available upon publication

详情

AI中文摘要

爆发抑制（BS）是一种临床相关的脑电图（EEG）模式，用于监测危重患者的镇静深度和脑活动，特别是在重症监护病房（ICU）的诱导昏迷期间。自动爆发检测仍然具有挑战性，因为BS模式在不同患者之间差异很大，且标注数据集稀缺。最近，EEG基础模型（FMs）在多个下游EEG应用中显示出前景，但它们在BS检测中的实用性尚未被探索。我们提出了第一项研究，评估EEG FMs在减少导联的ICU EEG中无需患者校准的爆发检测性能。我们将REVE-base、LUNA-large和LuMamba-Tiny与自适应阈值基线以及任务特定的EEGNet基线进行比较。此外，我们补充了基于事件的爆发检测评估，以替代传统的EEG窗口分类。这有助于临床评估爆发事件是否被正确检测，减少预期标注变异性的影响。最佳模型REVE-base取得了最高的事件级F1分数（$0.868 \pm 0.167$），并且与EEGNet和自适应阈值相比，分别将每分钟爆发错误减少了52.1%和36.2%，支持了FMs在ICU中可扩展的EEG监测。消融实验表明，与冻结骨干训练、两步微调和基于LoRA的适应相比，全微调是最有效的适应策略，对于LUNA-large，事件级F1分数比冻结骨干训练提高了最多$+0.102$。在减少标注数据集的情况下，预训练的REVE-base在25%的队列中比随机初始化高出$+0.723$事件级F1点，证明了在有限标注数据下适应爆发检测时预训练FM表示的优势。

英文摘要

Burst suppression (BS) is a clinically relevant electroencephalographic (EEG) pattern used to monitor sedation depth and brain activity in critically ill patients, particularly during induced coma in Intensive Care Units (ICUs). Automatic burst detection remains challenging because BS patterns vary substantially between patients and annotated datasets are scarce. Recently, EEG Foundation Models (FMs) have shown promise across several downstream EEG applications, but their usefulness for BS detection remains unexplored. We present the first study to evaluate EEG FMs for burst detection in reduced-montage ICU EEG without patient-specific calibration. We compare REVE-base, LUNA-large and LuMamba-Tiny with an adaptive thresholding baseline and a task-specific EEGNet baseline. Additionally, we complement conventional EEG window-based classification with event-based burst detection evaluation. This helps assessing clinically whether burst episodes are correctly detected, reducing the impact of expected annotation variability. The best model, REVE-base, achieved the highest event-based F1-score ($0.868 \pm 0.167$) and reduced burst-per-minute error by 52.1% and 36.2% compared to EEGNet and adaptive thresholding respectively, supporting FMs for scalable EEG monitoring in ICU. Ablation experiments showed that full fine-tuning was the most effective adaptation strategy with respect to frozen-backbone training, two-step fine-tuning, and LoRA-based adaptation, improving event-based F1-score over frozen-backbone training by up to $+0.102$ for LUNA-large. With reduced labeled datasets, pretrained REVE-base outperformed random initialization by $+0.723$ event-based F1 points at 25% of the cohort, demonstrating the benefit of pretraining FM representations when adapted to burst detection with limited labeled data.

URL PDF HTML ☆

赞 0 踩 0

2606.19951 2026-06-19 eess.AS cs.CL cs.LG cs.SD 新提交

Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

通过声学和韵律扰动研究语音质量评估中的人机差异

Masato Takagi, Masaya Kawamura, Reo Shimizu, Yuma Shirahata

AI总结通过声学退化、韵律错误和说话人特征扰动，发现MOS预测模型对声学退化敏感，但对韵律错误不敏感，且对基频有偏见，而对语速和基频变化不敏感。

Comments Accepted to INTERSPEECH 2026

详情

AI中文摘要

平均意见得分（MOS）预测模型在文本到语音（TTS）研究中被广泛用作代理指标，但它们捕捉超出声学保真度的质量差异的能力仍不清楚。我们通过控制性扰动来研究这一点：声学退化、韵律错误以及说话人特定特征（如音高和语速）的操纵。我们从人类听众和模型那里获得了这些语音样本的MOS预测，并分析了它们感知特征的差异。结果表明，大多数模型能很好地跟踪声学退化，而所有模型对韵律错误不敏感，尽管主观评分大幅下降。对于说话人特征，模型表现出双重分离：在人类评分中不存在的强平均基频（F0）偏见，但对人类注意到的语速和F0变化不敏感。这些发现突出了标量MOS预测在声学保真度之外的局限性。

英文摘要

Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradation, prosodic errors, and manipulation of speaker-specific characteristics such as pitch and speaking rate. We obtained MOS predictions for these speech samples from both human listeners and the model, and analyzed the differences in their perceptual characteristics. Results show that most models track acoustic degradation well, while all are insensitive to prosodic errors despite large subjective score drops. For speaker characteristics, models exhibit a double dissociation: strong mean fundamental frequency (F0) biases absent in human ratings, yet insensitivity to speaking rate and F0 variability that humans notice. These findings highlight limitations of scalar MOS prediction beyond acoustic fidelity.

URL PDF HTML ☆

赞 0 踩 0

2606.19943 2026-06-19 eess.IV cs.AI 新提交

SIMBA: ABidirectional Retrieval Forward Simulation Framework for Modeling FY-4A GIIRS Hyperspectral Infrared Radiances Toward NWP Applications

SIMBA：面向NWP应用的FY-4A GIIRS高光谱红外辐射双向检索正向模拟框架

Jingdong Shen, Fu Wang*, Qifeng Lu, Hao Huang, Chunqiang Wu, Chi Yang, Xiaofang Liu

AI总结提出SIMBA框架，联合进行大气廓线检索和辐射重建，通过循环一致性约束和双向Mamba模块增强耦合，在FY-4A GIIRS数据上优于多种深度学习基线。

详情

AI中文摘要

高光谱红外观测是数值天气预报（NWP）的重要数据源，因为它们提供了大气温度和湿度垂直结构的丰富信息。然而，现有的深度学习方法主要关注从辐射到大气廓线的单向检索，而反向辐射模拟过程以及大气状态空间与辐射观测空间之间的一致性考虑不足。在本研究中，我们提出了SIMBA，一个用于FY-4A GIIRS高光谱红外辐射建模的统一双向检索-正向模拟框架，面向NWP应用。该框架联合执行大气廓线检索和辐射重建，引入循环一致性约束以加强两个过程之间的耦合，并采用双向Mamba状态空间模块来捕捉沿气压层的长程依赖。利用配准的FY-4A GIIRS观测和ERA5再分析数据，该方法在温度检索、比湿检索、长波辐射重建和中波辐射重建上进行了评估。实验结果表明，SIMBA在检索和重建任务上均优于多个代表性深度学习基线，而消融实验证实了双向设计和循环一致性机制的贡献。这些结果表明，所提出的框架对于联合大气廓线检索和高光谱红外辐射建模是有效的，并显示出未来在雅可比相关分析和面向NWP扩展方面的潜力。

英文摘要

Hyperspectral infrared observations are an important data source for numerical weather prediction (NWP) because they provide rich information on the vertical structure of atmospheric temperature and humidity. However, most existing deep learning methods mainly focus on one-way retrieval from radiances to atmospheric profiles, while the reverse radiance simulation process and the consistency between atmospheric state space and radiance observation space are insufficiently considered. In this study, we propose SIMBA, a unified bidirectional retrieval-forward simulation framework for FY-4A GIIRS hyperspectral infrared radiance modeling toward NWP applications. The framework jointly performs atmospheric profile retrieval and radiance reconstruction, introduces a cycle-consistency constraint to strengthen the coupling between the two processes, and employs a bidirectional Mamba state-space module to capture long-range dependencies along pressure levels. Using collocated FY-4A GIIRS observations and ERA5 reanalysis data, the proposed method is evaluated for temperature retrieval, specific humidity retrieval, long-wave radiance reconstruction, and medium-wave radiance reconstruction. Experimental results show that SIMBA outperforms several representative deep learning baselines across both retrieval and reconstruction tasks, while ablation experiments confirm the contribution of the bidirectional design and cycle-consistency mechanism. These results demonstrate that the proposed framework is effective for joint atmospheric profile retrieval and hyperspectral infrared radiance modeling, and suggest potential for future Jacobian-related analysis and NWP-oriented extensions.

URL PDF HTML ☆

赞 0 踩 0

2606.19823 2026-06-19 eess.AS cs.LG 新提交

Low-Burden Data Augmentation for Dysarthric ASR via Zero-Shot Voice Cloning

低负担数据增强：通过零样本语音克隆改善构音障碍语音识别

Satwinder Singh, Qianli Wang, Zihan Zhong, Clarion Mendes, Hasegawa-Johnson, Waleed Abdulla, Seyed Reza Shahamiri

AI总结针对构音障碍语音数据稀缺和变异性大的问题，提出使用零样本语音克隆（Higgs Audio V2）生成合成数据，微调Whisper-medium模型，在TORGO数据集上达到与真实数据微调相近的词错误率，并显著降低数据收集成本。

Comments Accepted to Interspeech 2026, Sydney, Australia

详情

AI中文摘要

由于数据稀缺和说话人之间高度变异，自动语音识别对于构音障碍语音仍然不可靠。虽然合成数据可以弥补这些不足，但传统方法通常需要大量的说话人特定数据，重新引入了数据收集瓶颈。我们研究零样本语音克隆作为一种低负担的增强策略，使用Higgs Audio V2克隆TORGO数据集中的说话人。我们在克隆数据、真实数据和混合数据上微调Whisper-medium，并在保留的真实语音上进行评估。与零样本基线（31.62%）相比，克隆数据微调实现了具有竞争力的26.00%词错误率，几乎与真实数据微调（24.44%）和混合数据微调（25.12%）相当。值得注意的是，对于中重度构音障碍说话人，克隆和混合微调优于真实数据微调。在SAP-1102上的跨语料库评估中，克隆微调取得了最佳结果（相对提升11.45%）。这些结果表明，零样本克隆提供了可扩展的训练数据，绕过了昂贵的数据收集瓶颈。

英文摘要

Automatic speech recognition remains unreliable for dysarthric speech due to data scarcity and high inter-speaker variability. While synthetic data can address these gaps, traditional methods often require extensive speaker-specific data, reintroducing the collection bottleneck. We investigate zero-shot voice cloning as a low-burden augmentation strategy, using Higgs Audio V2 to clone speakers in the TORGO dataset. We fine-tune (FT) Whisper-medium on cloned, real, and hybrid data and evaluate on held-out real speech. Compared to the zero-shot (31.62%), Clone FT achieved a competitive 26.00% WER, nearly matching the 24.44% and 25.12% seen with Real and Hybrid FT, respectively. Notably, Clone and Hybrid FT outperform Real FT for moderate-severe speakers. Clone FT achieves the best results (11.45% relative) in cross-corpus evaluation on the SAP-1102. These results suggest that zero-shot cloning provides scalable training data that circumvents the costly data collection bottleneck.

URL PDF HTML ☆

赞 0 踩 0

2606.19797 2026-06-19 eess.AS cs.AI cs.SD eess.SP 新提交

Improving End-to-End Speech Recognition for Dysarthric Speech through In-Domain Data Augmentation

通过域内数据增强改进构音障碍语音的端到端语音识别

Paban Sapkota, Hemant Kumar Kathania, Sudarsana Reddy Kadiri, Shrikanth Narayanan

AI总结针对构音障碍语音识别中数据稀缺和严重程度差异的问题，本文探索了四种数据增强方法（SRM、PM、FM、VTLP）对预训练Wav2Vec2模型进行微调，在不同严重程度上实现了显著的字错误率降低。

详情

AI中文摘要

构音障碍语音识别对于促进构音障碍患者之间的有效沟通至关重要。然而，由于严重程度不同和数据可用性有限，准确识别构音障碍语音面临重大挑战。在本文中，我们通过微调端到端预训练Wav2Vec2模型，探索了针对构音障碍自动语音识别（ASR）系统的数据增强技术，特别关注严重程度级别。为了解决数据稀缺以及微调预训练ASR系统用于构音障碍语音时需要大量数据的问题，我们研究了四种主要的数据增强方法：语速修改（SRM）、音高修改（PM）、共振峰修改（FM）和声道长度扰动（VTLP），这些方法针对构音障碍的不同方面进行了调整。本研究使用为每个严重程度类别单独微调的Wav2Vec2模型作为基线系统。此外，我们使用增强数据对ASR模型进行了特定严重程度的微调。结果表明，每种增强技术在不同严重程度级别上表现出不同的有效性模式。对于\textit{低}（9.02%）和\textit{中}（38.11%）严重程度，使用SRM（$s$=0.8）获得了最佳WER；对于\textit{高}严重程度（55.15%），使用PM（$\ au$=0.8）获得了最佳WER，分别相对改进了30.02%、16.64%和15.47%。这些结果证实了增强方法在提高构音障碍ASR性能方面的有效性。

英文摘要

Dysarthric speech recognition is crucial for facilitating effective communication among individuals with dysarthria. However, accurately recognizing dysarthric speech poses significant challenges due to varying severity levels and limited data availability. In this paper, we explore data augmentation techniques for dysarthric automatic speech recognition (ASR) systems by fine-tuning the End-to-End pre-trained Wav2Vec2 model, with a specific focus on severity levels. To address the challenges of data scarcity and the need for extensive data in fine-tuning pre-trained ASR systems for dysarthric speech, we investigate four prominent data augmentation methods: Speaking-Rate Modification (SRM), Pitch Modification (PM), Formant Modification (FM), and vocal tract Length Perturbation (VTLP), tailored to different aspects of dysarthria. The study uses individually fine-tuned Wav2Vec2 models for each severity class as baseline systems. Additionally, we conducted severity-specific fine-tuning of the ASR model using augmented data. Results demonstrate distinct efficacy patterns for each augmentation technique across severity levels. The best WERs were achieved with SRM ($s$=0.8) for \textit{low} (9.02\%) and \textit{medium} (38.11\%) severities, and with PM ($τ$=0.8) for \textit{high} severity (55.15\%), reflecting relative improvements of 30.02\%, 16.64\%, and 15.47\%, respectively. These results confirm the effectiveness of the augmentation methods in improving dysarthric ASR performance.

URL PDF HTML ☆

赞 0 踩 0

2606.19793 2026-06-19 eess.AS cs.AI cs.LG cs.SD eess.SP 新提交

Systematic Study of Dysarthric Speech Recognition: Spectral Features and Acoustic Models

构音障碍语音识别的系统研究：频谱特征与声学模型

Paban Sapkota, Hemant Kumar Kathania, Mikko Kurimo, Sudarsana Reddy Kadiri, Shrikanth Narayanan

AI总结本文系统研究不同频谱特征与声学模型的组合，通过引入音高特征和优化训练帧重叠数，在F-TDNN模型上实现孤立词和句子识别相对提升4.65%和4.63%。

详情

AI中文摘要

识别构音障碍语音的挑战主要源于发音精度受损导致的显著声学变异性。过去的研究表明，通过使用混合DNN/HMM序列区分性训练可以改善识别性能。本文对不同声学模型定制的各种声学特征组合进行了全面研究，为每种模型提供了合适的特征选择。音高特征的引入显著提高了识别性能，特别是对于涉及构音障碍语音的句子识别任务。通过对TORGO数据库的系统检查，我们证明了增强最先进的因子化时延神经网络（F-TDNN）模型识别构音障碍语音性能的潜力。使用F-TDNN模型实现的方法，与先前研究相比，在构音障碍语音的孤立词识别中获得了4.65%的相对改进，在句子识别中获得了4.63%的相对改进。这种改进有效补偿了语音变异性，这归因于我们精心选择了连续训练样本块之间的重叠帧数。

英文摘要

The challenge associated with recognizing dysarthric speech primarily arises from pronounced acoustic variability attributed to impaired articulatory precision. Past research has demonstrated improved recognition through the use of hybrid DNN/HMM sequence discriminative training. This paper presents a comprehensive investigation of various combinations of acoustic features tailored to different Acoustic Models, offering suitable feature selections for each. The incorporation of Pitch features notably improved recognition performance, especially for sentence recognition tasks involving dysarthric speech. Through a systematic examination of the TORGO database, we have demonstrated the potential to enhance the performance of the state-of-the-art Factorized Time Delay Neural Network (F-TDNN) model for recognizing dysarthric speech. Our methods, implemented with the F-TDNN model, resulted in a 4.65\% relative improvement in isolated word recognition and a 4.63\% relative improvement in sentence recognition for dysarthric speech, compared to previous research. This improvement effectively compensates for speech variability, attributable to our deliberate selection of the number of overlapping frames between consecutive training example chunks.

URL PDF HTML ☆

赞 0 踩 0

2606.19791 2026-06-19 eess.AS cs.AI cs.SD 新提交

Cross-Dataset, Age, and Gender Generalization: A Comprehensive Analysis of Fine-Tuning Strategies for Low-Resource Children's ASR

跨数据集、年龄和性别泛化：低资源儿童语音识别的微调策略综合分析

Paban Sapkota, Hemant Kumar Kathania, Mikko Kurimo, Sudarsana Reddy Kadiri, Shrikanth Narayanan

AI总结针对低资源儿童语音识别，系统分析了不同微调策略在跨数据集、年龄和性别泛化上的表现，发现特定策略能显著提升泛化能力。

详情

AI中文摘要

与识别构音障碍语音相关的挑战主要源于发音精度受损导致的显著声学变异性。过去的研究表明，使用混合DNN/HMM序列判别训练可以改善识别性能。本文对不同声学模型定制的各种声学特征组合进行了全面研究，为每种模型提供了合适的特征选择。音高特征的加入显著提升了识别性能，尤其是在涉及构音障碍语音的句子识别任务中。通过对TORGO数据库的系统研究，我们展示了增强最先进的因子化时延神经网络（F-TDNN）模型识别构音障碍语音性能的潜力。我们使用F-TDNN模型实现的方法，与先前研究相比，在孤立词识别上实现了4.65%的相对改进，在句子识别上实现了4.63%的相对改进。这一改进有效补偿了语音变异性，这归因于我们对连续训练样本块之间重叠帧数的精心选择。

英文摘要

The challenge associated with recognizing dysarthric speech primarily arises from pronounced acoustic variability attributed to impaired articulatory precision. Past research has demonstrated improved recognition through the use of hybrid DNN/HMM sequence discriminative training. This paper presents a comprehensive investigation of various combinations of acoustic features tailored to different Acoustic Models, offering suitable feature selections for each. The incorporation of Pitch features notably improved recognition performance, especially for sentence recognition tasks involving dysarthric speech. Through a systematic examination of the TORGO database, we have demonstrated the potential to enhance the performance of the state-of-the-art Factorized Time Delay Neural Network (F-TDNN) model for recognizing dysarthric speech. Our methods, implemented with the F-TDNN model, resulted in a 4.65\% relative improvement in isolated word recognition and a 4.63\% relative improvement in sentence recognition for dysarthric speech, compared to previous research. This improvement effectively compensates for speech variability, attributable to our deliberate selection of the number of overlapping frames between consecutive training example chunks.

URL PDF HTML ☆

赞 0 踩 0

2606.19574 2026-06-19 eess.IV cs.CV 新提交

FrequencyFormer: A Co-Designed Sensor-to-Processor Pipeline for Frequency-Domain Vision Transformer Inference

FrequencyFormer: 面向频域视觉Transformer推理的协同设计传感器到处理器流水线

Chengwei Zhou, Ovishake Sen, Xuming Chen, Rishith Paramasivam, Shaahin Angizi, Swarup Bhunia, Baibhab Chatterjee, Gourav Datta

AI总结提出FrequencyFormer，通过多尺度DCT标记化将图像压缩为频域令牌，结合近传感器LUT硬件和低功耗通信架构，实现高达128倍数据压缩和28.8 TOPS/W能效，兼容多种视觉任务。

详情

AI中文摘要

在传感器边缘系统上部署视觉Transformer（ViT）不仅受限于设备计算能力，还受限于从传感器到处理器传输高维图像数据所需的能量和带宽。虽然传感器内和近传感器计算通过早期特征提取降低了这一成本，但现有方法通常仅提供适度的压缩。我们观察到频域提供了视觉信息的自然紧凑表示，并且可以在传感器级别利用以减少传感器到处理器的数据移动。基于这一见解，我们提出了FrequencyFormer，一种用于高效ViT推理的协同设计传感器到处理器流水线。FrequencyFormer包括：（1）多尺度DCT标记化器，将224x224图像压缩为紧凑的频域令牌，实现高达128倍的片外数据量减少，且精度损失较小；（2）基于查找表（LUT）的近传感器硬件实现，利用固定DCT系数实现无乘法器、节能且面积高效的标记化；（3）改进的基于MIPI的低功耗通信架构，进一步降低传输能量。FrequencyFormer可作为标准ViT补丁嵌入的直接替代，并与分类、检测和分割任务的预训练骨干网络兼容。该流水线实现了28.8 TOPS/W的能效，将通信能量降低230倍，并将总传感器侧能量降低2.22倍，展示了频域标记化作为传感器内ViT部署的可扩展基础。

英文摘要

Deploying vision transformers (ViTs) on sensor-edge systems is limited not only by on-device compute, but also by the energy and bandwidth required to transmit high-dimensional image data from the sensor to the processor. While in-sensor and near-sensor computing reduce this cost through early feature extraction, existing methods often provide only modest compression. We observe that the frequency domain provides a naturally compact representation of visual information and can be exploited at the sensor level to reduce sensor-to-processor data movement. Building on this insight, we present FrequencyFormer, a co-designed sensor-to-processor pipeline for efficient ViT inference. FrequencyFormer includes: (1) a multi-scale DCT tokenizer that compresses a 224x224 image into compact frequency-domain tokens, achieving up to 128x reduction in off-chip data volume with modest accuracy loss; (2) a LUT-based near-sensor hardware implementation that leverages fixed DCT coefficients for multiplier-free, energy- and area-efficient tokenization; and (3) a modified MIPI-based low-power communication architecture that further reduces transfer energy. FrequencyFormer serves as a drop-in replacement for standard ViT patch embedding and remains compatible with pretrained backbones across classification, detection, and segmentation tasks. The pipeline achieves 28.8 TOPS/W, reduces communication energy by 230x, and lowers total sensor-side energy by 2.22x, demonstrating frequency-domain tokenization as a scalable foundation for in-sensor ViT deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.19372 2026-06-19 eess.IV cs.CV cs.LG 新提交

Full-Self Diagnostics (FSD): Physics-Grounded Visual Biomarker Inference from Smartphone Video via Inverse Problems and Operator Learning

全自诊断(FSD): 通过逆问题和算子学习从智能手机视频进行基于物理的可视生物标志物推断

Jonathan Thomas, Harsh Thaker

AI总结提出全自诊断(FSD)框架，结合物理前向模型、信息论可观测性、正则化逆问题、算子学习和随机变分推断，从9秒面部视频恢复生理状态，在59名受试者38812次扫描中验证，血糖MARD达29.86%。

Comments 38,812 paired scans, preliminary longitudinal validation of multichannel visual glucose inference (MARD 17 to 46 percent across cohorts); physics plus information theory plus operator learning framework

详情

AI中文摘要

我们提出全自诊断(FSD)，一个统一的数学框架，用于从消费级智能手机拍摄的无约束9秒面部视频中恢复潜在生理状态。该方法整合了五个相互增强的组件：(1)基于辐射传输方程和发色团吸收的物理前向模型，将相机观测映射到生物标志物浓度；(2)信息论可观测性理论，证明多通道视觉信号（光谱、脉搏、呼吸、微表情和眼动）与生理状态包含严格递增的互信息；(3)具有域均匀可辨识性保证的稳定Tikhonov正则化逆问题；(4)算子学习公式，实现跨设备、分辨率和人群的泛化；(5)可解释为随机变分推断的监督学习过程，从配对生物传感器真实值持续优化模型，性能随配对观测数量的平方根倒数比例提升。在59名受试者的38812次真实世界配对扫描上的实证验证展示了实际性能。第一作者自采数据（血糖范围35-550 mg/dL）的MARD为29.86%，97.57%的预测落在Clarke误差网格A+B区，仅0.27%在危险E区。一位管理良好的糖尿病参与者在较窄的70-180 mg/dL范围内达到MARD 17%。这些结果证实，消费级面部视频编码了足够的结构化信息，可在完全无约束条件下进行临床相关的非侵入性生物标志物推断，且性能随更多配对数据的可用性可预测地提升。

英文摘要

We present Full-Self Diagnostics (FSD), a unified mathematical framework for recovering latent physiological states from unconstrained 9-second facial videos captured by consumer smartphones. The approach integrates five mutually reinforcing components: (1) a physics-based forward model derived from the radiative transfer equation and chromophore absorption that maps camera observables to biomarker concentrations; (2) an information-theoretic observability theory proving that multi-channel visual signals (spectral, pulse, respiratory, micro-expression, and oculomotor) contain strictly increasing mutual information with physiological state; (3) a stable, Tikhonov-regularized inverse problem with domain-uniform identifiability guarantees; (4) an operator-learning formulation that enables generalization across devices, resolutions, and populations; and (5) a supervised learning procedure, interpretable as stochastic variational inference, that continuously refines the model from paired biosensor ground truth with performance improving proportionally to one over the square root of the number of paired observations. Empirical validation on 38812 real-world paired scans across 59 subjects demonstrates practical performance. Self-collected data from the lead author (glucose range 35-550 mg/dL) yields MARD of 29.86 percent with 97.57 percent of predictions in Clarke Error Grid Zones A+B and only 0.27 percent in the dangerous Zone E. A well-managed diabetic participant achieves MARD of 17 percent in the narrower 70-180 mg/dL band. These results confirm that consumer-grade facial video encodes sufficient structured information for clinically relevant, non-invasive biomarker inference under fully unconstrained conditions, with performance scaling predictably as more paired data becomes available.

URL PDF HTML ☆

赞 0 踩 0

2606.20401 2026-06-19 eess.SY cs.SY 新提交

PowerAgentBench-Dyn: A Benchmark for Agentic AI in Power System Dynamic Studies

PowerAgentBench-Dyn：电力系统动态研究中智能体AI的基准测试

Qian Zhang, Andrea Pomarico, Costas Mylonas, Magda Foti, Alberto Berizzi, Le Xie

AI总结提出PowerAgentBench-Dyn基准，用于评估基于LLM的智能体在电力系统动态分析任务中的能力，涵盖模型质量审查和安全风险筛选两个任务。

详情

AI中文摘要

基于大型语言模型（LLM）的智能体越来越多地被用于通过与软件工具交互、解释中间结果以及自主规划后续行动来自动化多步骤工程工作流。电力系统动态研究是这些智能体一个特别有前景但尚未充分探索的应用领域。与静态计算任务不同，动态研究通常需要更多时间进行模型参数校准、工程判断以及在受限动作空间下的决策。本文介绍了PowerAgentBench-Dyn，一个旨在评估智能体AI系统在电力系统动态分析任务上的基准测试。该基准针对那些不能简化为单一优化或编码任务的问题，而是需要经验丰富的电力系统工程师日常执行的那种推理、工具使用和迭代实验。所提出的框架包括两个初始基准任务。第一个是动态模型质量审查基准，评估智能体根据系统运营商指定的模型质量合规标准验证和诊断动态模型的能力。第二个是动态安全风险筛选基准，评估智能体利用语义记忆和有限的仿真预算从未见故障数据集中识别、排序和分析最关键短路事故，并提出和评估可能的缓解措施的能力。对于每个任务，我们定义了仿真环境、观测和动作空间以及评估指标。该基准在基于度量的意义上是可复现的：发布案例和仿真器设置定义了确定性评估器，而随机智能体行为通过重复运行使用成功率和其他指标进行评估。该基准支持未来用于电力系统运行和规划的智能体AI的开发。

英文摘要

Large Language Model (LLM)-based agents are increasingly being used to automate multi-step engineering work flows by interacting with software tools, interpreting intermediate results, and autonomously planning subsequent actions. Power system dynamic studies represent a particularly promising yet largely unexplored application domain for these agents. Unlike static computational tasks, dynamic studies often require more time on model parameter calibration, engineering judgment, and decision making under constrained action spaces. This paper introduces PowerAgentBench-Dyn, a benchmark designed to evaluate Agentic AI systems on power system dynamic-analysis tasks. The benchmark targets problems that cannot be reduced to a single optimization or coding task, but instead require a type of reasoning, tool usage, and iterative experimentation routinely performed by experienced power system engineers. The proposed framework includes two initial benchmark tasks. The first, the Dynamic Model Quality Review Benchmark, evaluates agents' ability to validate and diagnose dynamic models based on model-quality compliance criteria specified by system operators. The second, the Dynamic Security Risk Screening Benchmark, assesses agents' capability to leverage semantic memory and a limited simulation budget to identify, rank, and analyze the most critical short-circuit contingencies from an unseen fault dataset, as well as propose and evaluate possible mitigation measures. For each task, we define the simulation environment, observation and action spaces, and evaluation metrics. The benchmark is reproducible in a metric-based sense: released cases and simulator settings define a deterministic evaluator, while stochastic agent behavior is assessed over repeated runs using success rates and other metrics. The benchmark supports the development of future Agentic AI for power system operation and planning.

URL PDF HTML ☆

赞 0 踩 0

2606.20361 2026-06-19 eess.SY cs.SY 新提交

Sparse add-on controller design: A Youla approach to system-level performance

稀疏附加控制器设计：一种面向系统级性能的Youla方法

M. van der Hulst, N. Dirkx, R. A. González, K. Tiels, J. van de Wijdeven, T. oomen

AI总结提出一种基于Youla参数化的稀疏附加控制器设计框架，通过凸优化求解稀疏H2综合问题，实现系统级性能与互联复杂度的最优权衡。

2606.20301 2026-06-19 eess.SY cs.SY 新提交

Data-Driven Control from Poisoned Data: Fundamental Limitations and Secure DeePC

来自中毒数据的数据驱动控制：基本局限性与安全DeePC

Takumi Shinohara, Henrik Sandberg, Karl Henrik Johansson

AI总结针对任意数据中毒攻击，提出安全DeePC算法，通过截断输出和在线重建实现有限时间内的MPC等效性能。

详情

AI中文摘要

我们研究了存在任意数据中毒攻击时的数据驱动控制问题。假设一部分离线输出数据存储在未受保护的位置，可能被对手篡改。我们首先建立了由这种中毒数据引起的数据驱动控制的基本局限性：仅从数据集无法检测/识别中毒攻击；未受保护的数据对于具有最坏情况保证的控制器设计是非信息性的；未受保护输出的硬约束是不可认证的。受这些局限性和数据使能预测控制（DeePC）技术的启发，我们提出了安全DeePC，一种能够抵御中毒攻击的数据驱动控制算法。它首先仅使用受保护数据集运行输出截断的DeePC，直到在线输入变得持续激励。然后利用在线测量重建部分离线数据集，最后返回到全输出DeePC。安全DeePC在特定条件下几乎必然在有限时间内实现MPC等效性能。仿真结果证明了所提框架对抗中毒攻击的有效性。

英文摘要

We study a data-driven control problem in the presence of arbitrary data poisoning attacks. We assume that a subset of offline output data is stored in unprotected locations and may be poisoned by an adversary. We first establish fundamental limitations for data-driven control arising from such poisoned data: poisoning attacks are not detected/identified from the dataset alone; unprotected data are non-informative for controller design with worst-case guarantees; and hard constraints on unprotected outputs are not certifiable. Motivated by these limitations and the data-enabled predictive control (DeePC) technique, we propose Secure DeePC, a data-driven control algorithm that is resilient against poisoning attacks. It first runs output-truncated DeePC using only the protected dataset until the online input becomes persistently exciting. It then uses online measurements to reconstruct the partial offline dataset, and finally returns to full-output DeePC. Secure DeePC achieves MPC-equivalent performance in finite time almost surely under certain conditions. Simulation results illustrate the efficacy of the proposed framework against poisoning attacks.

URL PDF HTML ☆

赞 0 踩 0