arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.11135 2026-06-10 eess.SP 新提交

Pre-Fault Voltage Discrimination and Time-Domain Protection for Distribution Networks with Inverter-Based Resources

含逆变器资源的配电网故障前电压判别与时域保护

Junyuan Zhao, François Bouffard, Géza Joós

AI总结针对逆变器资源导致传统过流保护失效的问题，提出故障前电压判别策略结合时域保护原理，实现快速可靠故障检测。

详情

AI中文摘要

配电网中逆变器资源（IBRs）的日益普及给基于相量的过流保护带来了重大挑战。这一挑战源于IBRs缺乏短路电流供给能力。因此，传统的过流保护功能（例如ANSI 51）在此类场景中不足，需要替代方法。例如，时域保护有望克服这一挑战。本文提出了一种故障前电压判别（PVD）策略，其作用是检测故障并将正常开关和变压器励磁涌流扰动与实际故障区分开。PVD的使用允许通过使用含IBRs配电网的时域保护原理，设计一种简单而有效的故障检测算法。PVD的引入提供了更快的故障检测，同时不降低安全性和可靠性。离线仿真实验和控制器硬件在环实时仿真验证了所提算法在各种故障和正常开关事件中的有效性。

英文摘要

The increasing proliferation of inverter-based resources (IBRs) in distribution networks is presenting a major challenge for phasor-based overcurrent protection. This challenge stems from IBRs' lack of short-circuit current sourcing capacity. As a result, traditional overcurrent protection functions (e.g., ANSI 51) are inadequate in such scenarios, and warrant alternative approaches. Time-domain protection, for example, shows promise in overcoming this challenge. In this paper we propose a pre-fault voltage discrimination (PVD) strategy whose role is to detect faults and discriminate normal switching and transformer inrush disturbances from actual faults. The use of PVD allows for the design of a simple, yet effective fault detection algorithm by using time-domain protection principles for distribution networks containing IBRs. The introduction of PVD provides for faster fault detection without reducing security and dependability. Offline simulation experiments and controller hardware-in-the-loop real-time simulation validate the effectiveness of the proposed algorithm against various fault and normal switching events.

URL PDF HTML ☆

赞 0 踩 0

2606.10900 2026-06-10 eess.SP 新提交

Personalized Deep Learning for Short-Term Forecasting of Impending Atrial Fibrillation from Continuous Wearable ECG Signals

基于个性化深度学习的连续可穿戴心电图信号短期房颤预测

Jangwon Suh, Soonil Kwon, Jungmin Ko, Yun Kwan Kim, Hee Seok Song, Eue-Keun Choi, Wonjong Rhee

AI总结针对可穿戴心电图中房颤预测的个体差异问题，提出通过微调全局模型实现个性化预测，在三个数据集上显著提升性能，并揭示了心率、RMSSD等临床相关前兆特征。

详情

Comments: Code is available at https://github.com/SNU-DRL/Personalized-AF-Forecasting

AI中文摘要

背景与目的：连续可穿戴心电图监测越来越多地用于动态心律失常监测，然而预测即将发生的房颤面临患者间心电图变异的挑战。本研究探讨了通过基于个体心电图信号的微调来个性化全局模型是否能改善即将发生房颤的短期预测。方法：在ICENTIA11K数据集上训练的全局模型与在三个队列（ICENTIA11K、IRIDIA-AF和MobiCARE）上微调的个性化模型进行了比较。预处理后，模型处理60秒的心电图片段，预测未来五分钟。我们评估了适应数据量的影响，并分析了心电图特征，如心率和RMSSD。结果：个性化模型显著优于全局模型，在ICENTIA11K中AUROC为0.711 vs. 0.614，在MobiCARE中为0.686 vs. 0.585。个性化收益随着患者特定微调数据量的增加而增加。虽然全局模型的准确性随着房颤发作的临近而提高，但两个外部队列中的个性化模型表现出不同的时间动态，这可能表明捕获了患者特定特征，这些特征较少依赖于房颤事件的临近性。房颤前发作显示心率和RMSSD升高。特征归因突出了临床相关的前兆，包括频繁的房性早搏和短阵室上性心动过速。结论：使用患者特定的可穿戴心电图数据自适应深度学习模型显著增强了即将发生房颤的短期预测。这种个性化框架支持及时的预防性干预，并改善动态监测环境中的房颤管理。

英文摘要

Background and Objective: Continuous wearable electrocardiogram (ECG) monitoring is increasingly used for ambulatory arrhythmia surveillance, yet forecasting impending atrial fibrillation (AF) is challenged by inter-patient ECG variability. This study investigated whether personalizing a global model via fine-tuning on an individual's ECG signals improves short-term forecasting of impending AF. Methods: A global model trained on the ICENTIA11K dataset was compared against personalized models fine-tuned across three cohorts: ICENTIA11K, IRIDIA-AF, and MobiCARE. Following preprocessing, models processed 60-second ECG segments for a five-minute forecast horizon. We evaluated the impact of adaptation data volume and analyzed ECG features, such as heart rate and RMSSD. Results: Personalized models significantly outperformed the global model, achieving AUROCs of 0.711 vs. 0.614 in ICENTIA11K and 0.686 vs. 0.585 in MobiCARE. Personalization benefits increased with the amount of patient-specific fine-tuning data. While the global model's accuracy rose as AF onset approached, personalized models in the two external cohorts exhibited distinct temporal dynamics, which may indicate the capture of patient-specific characteristics less dependent on proximity to the AF event. Pre-AF episodes showed elevated heart rates and RMSSD. Feature attributions highlighted clinically relevant precursors, including frequent premature atrial complexes (PACs) and short supraventricular tachycardias (SVTs). Conclusions: Adapting deep learning models with patient-specific wearable ECG data significantly enhances short-term forecasting of impending AF. This personalized framework supports timely preventive interventions and improved AF management in ambulatory monitoring environments.

URL PDF HTML ☆

赞 0 踩 0

2606.10869 2026-06-10 eess.SP 新提交

Information Bottleneck Meets Quantization: Finite Rate Analysis and Optimal Designs

信息瓶颈遇上量化：有限速率分析与最优设计

Francesco Binucci, Paolo Banelli

AI总结本文理论分析了高斯信息瓶颈（GIB）潜在表示的标量和向量量化对目标数据信息性的影响，并提出了在有限速率约束下的任务导向量化设计，在MMSE回归问题上验证了有效性，最后将任务导向思想扩展到非高斯场景。

详情

Comments: 16 pages, 9 figures

AI中文摘要

信息瓶颈（IB）是一个成熟的框架，通过权衡速率和数据表示大小，寻找数据源的潜在紧凑表示，以获得相对于另一个目标数据的信息准确性。当目标与源联合高斯时，高斯IB（GIB）是其简单的闭式解。然而，在许多实际问题中，潜在表示必须由有限数量的比特存储或表示，而最优（G）IB解则不然。首先，本文从理论上分析了标量和向量量化对GIB潜在表示的影响，以及其对目标数据（非）信息性的影响。然后，通过在潜在表示上施加有限速率约束，重新表述GIB优化问题，提出了任务导向的量化设计。在MMSE回归问题上的仿真结果证实了所提出的量化设计的有效性，与标准GIB潜在表示的更启发式或分离的量化设计相比，显示出显著的增益。最后，通过适当修改用于IB启发的向量量化器的变分自编码器（VAE）中的代价函数，将任务导向思想扩展到非高斯设置。

英文摘要

The Information Bottleneck (IB) is a well established framework that looks for a latent compact representation of a data source, by trading rate and data-size representation, for information accuracy with respect to another target data. The Gaussian IB (GIB) is its simple closed form solution, when the target is jointly Gaussian with the source. Actually, in many practical problems the latent representation has to be stored or represented by a finite number of bits, while the optimal (G)IB solution has not. First, this manuscript theoretically analyzes the effect of scalar and vector quantization of the GIB latent representation, and its impact on the (dis)informativeness with respect to the target data. Then, task-oriented quantization designs are proposed by (jointly) reformulating the GIB optimization problem under a finite-rate constraint on the latent representation. Simulation results on MMSE regression problems confirm the effectiveness of the proposed quantization designs, which show significant gains with respect to more heuristic, or separate, quantization designs of the standard GIB latent representation. Finally, the paper extends the task-oriented philosophy to non-Gaussian settings, by properly modifying the cost function used in variational auto-encoders (VAEs) of IB-inspired vector quantizers.

URL PDF HTML ☆

赞 0 踩 0

2606.10864 2026-06-10 eess.AS 新提交

Phoneme-First Prediction for LLM-Based Speech Recognition

基于LLM的语音识别的音素优先预测

Jakob Poncelet, Hugo Van hamme

AI总结提出在LLM中集成音素预测步骤，先预测音素再生成转录，以提升低资源场景下的语音识别准确性和可解释性。

详情

Comments: Accepted at EUSIPCO 2026

AI中文摘要

近期研究探索了将大型语言模型（LLM）与语音编码器集成，以创建能够进行上下文感知语音识别的语音增强型LLM。主要挑战在于将LLM的语义嵌入与语音编码器的声学表示对齐。我们提出了一种新颖的方法，教导LLM首先从语音特征中预测音素，然后再生成最终转录。通过将音素预测步骤直接集成到LLM中，模型能够获得细粒度的发音知识，减少声学混淆，提高转录准确性和可解释性。我们的方法廉价且简单，因为音素目标可以从现有转录中自动推导。通过全面的实验，我们表明中间音素预测可以改善语音识别，特别是在低资源设置下，并且产生的输出在声学上更忠实于语音。

英文摘要

Recent research has explored integrating Large Language Models (LLMs) with speech encoders to create speech-augmented LLMs capable of contextualized speech recognition. The main challenge lies in aligning the semantic embeddings of LLMs with the acoustic representations of speech encoders. We propose a novel approach that teaches the LLM to first predict phonemes from the speech features before generating the final transcript. By integrating a phoneme prediction step directly into the LLM, the model develops a fine-grained knowledge of pronunciation, reducing acoustic confusion and improving transcription accuracy and explainability. Our method is cheap and simple, as phoneme targets can be automatically derived from existing transcripts. Through comprehensive experiments, we show that intermediate phoneme prediction can improve speech recognition, particularly in low-resource settings, and yields outputs that are acoustically more faithful to the speech.

URL PDF HTML ☆

赞 0 踩 0

2606.10853 2026-06-10 eess.AS 新提交

Speech Encoder Fusion for LLM-based Automatic Speech Recognition

面向基于LLM的自动语音识别的语音编码器融合

Jakob Poncelet, Hugo Van hamme

AI总结研究融合多个预训练语音编码器以增强基于LLM的ASR性能，提出多种融合策略并在多场景下验证其有效性。

详情

Comments: Accepted at Interspeech 2026

AI中文摘要

语音感知的大语言模型（LLMs）可以通过预训练的声学编码器将语音特征投影到LLM嵌入空间中来整合语音。虽然语音编码器的选择对性能有重要影响，但不同的编码器通常表现出互补的优势，这激发了它们的组合。在这项工作中，我们研究了融合多个预训练语音编码器是否能增强用于自动语音识别（ASR）的语音感知LLMs。我们探索了多种超越简单特征拼接的融合策略，包括学习组合和基于Transformer的融合架构，并在单语和多语ASR设置以及带说话人日志的语音识别中进行了评估。我们的结果表明，仔细融合多个并行语音编码器能在所有场景中提升下游性能，且计算开销有限。

英文摘要

Speech-aware large language models (LLMs) can incorporate speech through pre-trained acoustic encoders that project speech features into the LLM embedding space. While the choice of the speech encoder critically influences performance, different encoders often exhibit complementary strengths, motivating their combination. In this work, we investigate whether fusing multiple pre-trained speech encoders can enhance speech-aware LLMs for automatic speech recognition (ASR). We explore several fusion strategies beyond simple feature concatenation, including learned combinations and Transformer-based fusion architectures, and evaluate them across mono- and multilingual ASR settings as well as diarized speech recognition. Our results indicate that carefully fusing multiple parallel speech encoders improves downstream performance in all scenarios with limited computational overhead.

URL PDF HTML ☆

赞 0 踩 0

2606.10838 2026-06-10 eess.AS 新提交

Towards Deep Contextual Reasoning from Broad Descriptions for ASR with Speech-LLM via Metadata-Driven Reasoning Chains

面向语音-大语言模型的基于元数据驱动推理链的宽描述深度上下文推理

Jakob Poncelet, Hugo Van hamme

AI总结提出一种训练方法，使语音-LLM利用宽描述作为弱语义先验，通过链式推理进行上下文修正，降低罕见词和命名实体错误率。

详情

Comments: Accepted at Interspeech 2026

AI中文摘要

语音识别在罕见领域特定术语和上下文相关的命名实体上常常失败。现有的上下文化技术通常使用关键词或短语列表来偏置解码，这难以扩展或利用更深层次的知识。我们提出一种训练方法，教会语音-LLM使用宽描述（例如来自视频的描述）作为弱语义先验，以执行基于音频的上下文推理。我们通过将错误假设与视频元数据和LLM生成的推理解释配对，构建了400小时的推理增强语音数据，这些解释证明了上下文驱动的修正。我们微调语音-LLM以执行思维链推理：生成初始转录，然后对上下文进行推理，最后返回修正后的转录。在保留的YouTube测试集上，我们的方法减少了错误，特别是在罕见词和命名实体上有所改进，并为语音识别中更深层次的上下文推理奠定了基础。

英文摘要

Speech recognition often fails on rare, domain-specific terms and context-related named entities. Existing contextualization techniques typically bias decoding with keywords or phrase lists, which does not scale well or exploit deeper knowledge. We propose a training method that teaches a speech-LLM to use broad descriptions (e.g. from videos) as weak semantic priors to perform contextual reasoning grounded in the audio. We build 400 hours of reasoning-augmented speech data by pairing erroneous hypotheses with video metadata and LLM-generated reasoning explanations that justify context-driven corrections. We finetune the speech-LLM to perform chain-of-thought reasoning: generate an initial transcript, then reason over the context, and finally return a corrected transcript. On held-out YouTube-derived test sets, our approach reduces errors, with specific improvements on rare words and named entities, and lays groundwork for deeper contextual reasoning in speech recognition.

URL PDF HTML ☆

赞 0 踩 0

2606.10758 2026-06-10 eess.AS 新提交

Anchoring the Unknown: Open-Set Model Attribution via Proxy-Anchor Learning

锚定未知：基于代理-锚点学习的开放集模型归因

Cristian-Teodor Neamtu, Serban Mihalache, Stefan Smeu, Dan Oneata, Horia Cucu, Dragos Burileanu

AI总结提出基于代理-锚点损失函数的度量学习框架，利用Wav2Vec2-BERT嵌入实现TTS源归因和未知系统检测，在140个TTS系统上达到99.76%准确率和2.04%误报率。

详情

Comments: Accepted to the 34th European Signal Processing Conference (EUSIPCO 2026)

AI中文摘要

能够生成逼真合成语音的文本到语音（TTS）系统的激增给音频取证带来了日益严峻的挑战。虽然二元深度伪造检测已受到广泛关注，但源追踪（即识别哪个TTS系统产生了给定的音频样本）仍未被充分探索，尤其是在可能遇到未知系统的开放集场景中。我们提出了一种基于代理-锚点损失函数的度量学习框架，该框架在Wav2Vec2-BERT嵌入上操作，以学习用于TTS源归因和未见系统分布外（OOD）检测的判别性嵌入空间。我们在涵盖51种语言、140个TTS系统的MLAAD v9数据集上进行了评估，并引入了一种架构合并策略，将TTS系统版本分组为统一类别，减少了类间混淆。我们的系统在110个分布内类别上达到了99.76%的准确率，OOD检测的假阳性率（FPR@95）低至2.04%。此外，为了与当前最先进的方法进行公平比较，我们进一步在MLAAD v5官方数据集划分上进行了评估，将OOD准确率提高了近一倍。这些结果表明，代理-锚点度量学习结合架构感知的类别设计和事后OOD评分，为闭集和开集场景下的取证TTS源追踪提供了一个有效的框架。

英文摘要

The proliferation of text-to-speech (TTS) systems capable of generating realistic synthetic speech poses growing challenges for audio forensics. While binary deepfake detection has received considerable attention, source tracing (i.e., identifying which TTS system produced a given audio sample) remains underexplored, particularly in open-set scenarios where unknown systems may be encountered. We propose a metric learning framework based on the Proxy-Anchor loss function that operates on Wav2Vec2-BERT embeddings to learn a discriminative embedding space for TTS source attribution and out-of-distribution (OOD) detection of unseen systems. We evaluate it on the MLAAD v9 dataset spanning 140 TTS systems across 51 languages, and introduce an architecture merging strategy that groups TTS system versions into unified classes, reducing inter-class confusion. Our system achieves 99.76% accuracy on 110 in-distribution classes and a False Positive Rate (FPR@95) as low as 2.04% for OOD detection. Also, for a fair comparison against the current state of the art, we further evaluate it on the MLAAD v5 official dataset splits, improving the OOD accuracy by almost doubling it. These results demonstrate that Proxy-Anchor metric learning, combined with architecture-aware class design and post-hoc OOD scoring, provides an effective framework for forensic TTS source tracing in both closed-set and open-set settings.

URL PDF HTML ☆

赞 0 踩 0

2606.10540 2026-06-10 eess.SP 新提交

Complex VAE with Heavy-Tailed Likelihood for Radar Target Detection in Sea Clutter

基于重尾似然的复变分自编码器在海杂波中雷达目标检测

Ting Bai, Jun Tang, Yuxin Xu

AI总结针对海杂波重尾、尖峰特性及目标标签稀缺问题，提出无监督复变分自编码器，采用Student-t负对数似然捕获重尾重构误差，并引入时域幅度误差约束，实现恒虚警率下的雷达目标检测。

详情

AI中文摘要

为了解决海杂波的重尾、尖峰特性以及标记目标数据的稀缺性，提出了一种用于海上雷达目标检测的无监督复值变分自编码器（VAE）。在实现中，每个复基带慢时间序列由其同相和正交分量表示，模型学习仅从杂波数据中重构它们。采用Student-$t$负对数似然来捕获重尾重构误差，同时减少杂波学习期间对异常值的敏感性。此外，引入了时域幅度误差约束，以惩罚重构中的慢时间幅度失配。在推理时，重构偏差用作检测统计量，并通过从仅杂波验证集估计的经验分位数设置决策阈值，以实现恒虚警率（CFAR）。在实测海杂波数据上的实验表明，在CFAR约束下，检测性能相对于MF、AMF和实值$\beta$-VAE持续提升。

英文摘要

To address the heavy-tailed, spike-prone nature of sea clutter and the scarcity of labeled target data, an unsupervised complex-valued variational autoencoder (VAE) for maritime radar target detection is proposed. In implementation, each complex baseband slow-time sequence is represented by its in-phase and quadrature components, and the model learns their joint reconstruction from clutter-only data. A Student-$t$ negative log-likelihood is adopted to capture heavy-tailed reconstruction errors while reducing sensitivity to outliers during clutter learning. In addition, a time-domain amplitude error constraint is introduced to penalize slow-time magnitude mismatch in the reconstruction. At inference, reconstruction deviation is used as the detection statistic, and the decision threshold is set via an empirical quantile estimated from a clutter-only validation set to enforce a constant false-alarm rate (CFAR). Experiments on measured sea-clutter data show that detection performance is consistently improved over MF, AMF, and a real-valued $β$-VAE under CFAR constraints.

URL PDF HTML ☆

赞 0 踩 0

2606.10464 2026-06-10 eess.AS 新提交

GC-LoRA: Gated Convolutional LoRA for Parameter-Efficient Acoustic Adaptation

GC-LoRA：用于参数高效声学适应的门控卷积LoRA

Natarajan Balaji Shankar, Zilai Wang, Kaiyuan Zhang, Mohan Shi, Abeer Alwan

AI总结提出GC-LoRA适配器架构，通过注入Conformer风格的局部卷积处理到预训练Transformer编码器中，高效捕捉局部声学依赖，在多种声学失配领域实现高达10.9%的词错误率降低。

详情

Comments: Accepted for publication at Interspeech 2026

AI中文摘要

基于Transformer的语音基础模型在大多数自动语音识别任务中表现出色，但在应用于声学特性不匹配的领域时，性能往往会下降。虽然参数高效微调（PEFT）方法（如低秩适应（LoRA））调整全局注意力，但它们缺乏对于捕捉领域特定变化至关重要的局部上下文建模。我们提出了GC-LoRA，一种新颖的适配器架构，将Conformer风格的局部卷积处理注入到预训练的Transformer编码器中。通过将轻量级适配器集成到编码器注意力输出投影中，我们的方法在不干扰预训练全局表示的情况下，高效地捕捉局部声学依赖。在多种数据集（声学退化、带限、方言、儿童语音）上的实验证明了我们方法的有效性，与基线相比，实现了高达10.9%的词错误率（WER）降低，同时仅增加少量可训练参数。

英文摘要

Transformer-based Speech Foundation Models excel in most Automatic Speech Recognition tasks but often suffer performance degradation when applied to domains with mismatched acoustic characteristics. While Parameter Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), adjust global attention, they lack the local context modeling crucial for capturing domain-specific variations. We propose GC-LoRA, a novel adapter architecture that injects Conformer-style local convolutional processing into pretrained Transformer encoders. By integrating a lightweight adapter to encoder attention output projections, our method efficiently captures local acoustic dependencies without disrupting pretrained global representations. Experiments across diverse datasets (acoustically-degraded, bandlimited, dialectal, child) demonstrate the efficacy of our approach, achieving Word Error Rate (WER) reductions of up to 10.9% compared to baselines while adding minimal trainable parameters.

URL PDF HTML ☆

赞 0 踩 0

2606.10240 2026-06-10 eess.IV 新提交

Laplace-Mixture Dipole Inversion for Quantitative Susceptibility Mapping

拉普拉斯混合偶极子反演用于定量磁化率成像

Shuai Huang, James J. Lah, Jason W. Allen, Deqiang Qiu

AI总结提出一种基于拉普拉斯混合先验的自动偶极子反演方法（LAMDI），无需手动调参即可在定量磁化率成像中保留精细解剖结构，性能与现有方法相当。

详情

AI中文摘要

目的：开发一种用于定量磁化率成像（QSM）的自动偶极子反演方法，在无需手动调整正则化参数的情况下保留精细解剖结构。理论：原始的带参数估计的近似消息传递（AMP-PE）框架使用单一拉普拉斯先验对图像梯度建模，未能充分捕捉脑磁化率图的重尾梯度分布。这种先验不匹配可能导致过度正则化和块状重建。我们通过使用双分量拉普拉斯混合先验对梯度建模来解决这一局限性。方法：我们提出一种拉普拉斯混合偶极子反演（LAMDI）方法，将双分量拉普拉斯混合先验融入具有自动参数估计的AMP-PE框架中。LAMDI在公开的体内数据集上进行了评估。其性能与FANSI、MEDI以及使用单一拉普拉斯先验的AMP-PE（AMP-PE-L1）在标准默认设置和参考调优设置下进行了比较。结果：在公开的多方向QSM数据集上，LAMDI实现了与AMP-PE-L1相当的NRMSE和SSIM，同时显著降低了HFEN，表明其更好地保留了高频解剖细节。在基于参考的调优下，FANSI和MEDI在某些指标上达到了最佳性能，但LAMDI在无需参考图或手动正则化调优的情况下仍具有竞争力。结论：LAMDI通过结合有竞争力的重建精度和改进的精细解剖细节保留，为QSM偶极子反演提供了一种有效且自动的参数估计替代方案。

英文摘要

Purpose: To develop an automatic dipole inversion method for quantitative susceptibility mapping (QSM) that preserves fine anatomical structures without the need for manual regularization-parameter tuning. Theory: The original approximate message passing with parameter estimation (AMP-PE) framework models image gradients with a single Laplace prior, which does not fully capture the heavy-tailed gradient distribution of brain susceptibility maps. This prior mismatch can lead to over-regularization and blocky reconstructions. We address this limitation by modeling the gradients with a two-component Laplace mixture prior. Methods: We propose a Laplace-Mixture Dipole Inversion (LAMDI) method by incorporating a two-component Laplace mixture prior into the AMP-PE framework with automatic parameter estimation. LAMDI was evaluated on a public in vivo dataset. Its performance was compared with FANSI, MEDI, and AMP-PE with a single-Laplace prior (AMP-PE-L1) under both standard default and reference-tuned settings. Results: On a public multi-orientation QSM dataset, LAMDI achieved NRMSE and SSIM comparable to AMP-PE-L1 while substantially reducing HFEN, suggesting improved preservation of high-frequency anatomical detail. Under reference-based tuning, FANSI and MEDI achieved the best performance for some metrics, but LAMDI remained competitive without requiring reference maps or manual regularization tuning. Conclusion: LAMDI provides an effective and automatic parameter-estimation alternative for QSM dipole inversion by combining competitive reconstruction accuracy with improved preservation of fine anatomical detail.

URL PDF HTML ☆

赞 0 踩 0

2606.10190 2026-06-10 eess.SP 新提交

Optimal Illumination via Joint Movement and Phase Optimization for Movable Antenna-RIS Configuration

可移动天线-RIS配置的联合移动与相位优化的最优照明

Yan Zhang, Nicola Marchetti, Indrakshi Dey

AI总结提出可移动天线增强RIS架构，利用随机微分方程建模天线移动，通过两时间尺度框架优化长期信噪比，实现高达36 dB稳态增益和16倍能效提升。

详情

AI中文摘要

可重构智能表面（RIS）能够实现对无线传播的可编程控制，但在静态部署中仍易受持续深度衰落的影响。本文引入了一种可移动天线增强的RIS（MA-RIS）架构，其中天线元件物理重新定位以采样独立的空间信道，从而实现移动性带来的分集。我们使用随机微分方程（SDE）框架对天线运动进行建模，该框架捕获了受控漂移和环境扩散。基于伊藤微积分的分析表征了稳态天线分布、空间去相关和中断概率，揭示了控制强度与移动随机性之间的基本权衡。为了在考虑控制开销的同时最大化长期信噪比，我们提出了一种开销感知的两时间尺度框架，将慢速天线轨迹控制与快速相位适应分离。通过汉密尔顿-雅可比-贝尔曼（HJB）公式的预测近似求解随机最优控制问题，实现了实时实现。仿真验证了理论预测：两时间尺度策略实现了高达36 dB的稳态信噪比，具有显著的稳定性，比仅位置控制高出15 dB，比未控制基线高出30 dB以上。尽管信噪比低于有源RIS，但所提出的方法在不同系统规模下实现了高达16倍的能效提升，为弹性无线系统建立了移动性驱动的信道适应新范式。

英文摘要

Reconfigurable intelligent surfaces (RIS) enable programmable control of wireless propagation but remain vulnerable to persistent deep fades in static deployments. This paper introduces a Movable Antenna-enhanced RIS (MA-RIS) architecture where antenna elements physically reposition to sample independent spatial channels, enabling mobility-induced diversity. We model antenna motion using a Stochastic Differential Equation (SDE) framework capturing controlled drift and environmental diffusion. It^o calculus-based analysis characterizes steady-state antenna distributions, spatial decorrelation, and outage probability, revealing fundamental trade-offs between control strength and mobility randomness. To maximize long-term SNR while accounting for control overhead, we propose an overhead-aware Two-timescale framework separating slow antenna trajectory control from fast phase adaptation. The stochastic optimal control problem is solved via predictive approximation of the Hamilton-Jacobi-Bellman (HJB) formulation, enabling real-time implementation. Simulations validate theoretical predictions: the Two-timescale strategy achieves up to 36 dB steady-state SNR with remarkable stability, outperforming position-only control by up to 15 dB and uncontrolled baselines by over 30 dB. Despite experiencing a lower SNR than Active RIS, the proposed approach delivers up to 16 times higher energy efficiency (EE) across varying system scales, establishing a new paradigm of mobility-enabled channel adaptation for resilient wireless systems.

URL PDF HTML ☆

赞 0 踩 0

2606.10164 2026-06-10 eess.SP 新提交

Curved Beam Enabled Wireless Communications: Modeling, Analysis and Optimization

弯曲波束赋能无线通信：建模、分析与优化

Jiawei Yao, Xiaoren Xu, Walid Saad, Mingzhe Chen

AI总结针对障碍物场景，提出利用连续孔径阵列生成弯曲波束以提升无线通信性能，通过建模波束控制与分段信道，设计基于分数规划和增强块坐标上升的迭代算法优化加权和速率。

详情

AI中文摘要

本文研究了在存在障碍物的情况下，利用弯曲波束提升无线通信性能的问题。特别地，配备连续孔径阵列的发射机可以通过允许信号沿直线和弯曲路径传播来生成弯曲波束，以服务多个接收机。为了优化加权和速率，本文开发了一种弯曲波束模型，用于控制波束转向、波束聚焦和波束弯曲功能，并建立了一种分段信道模型来表征由障碍物引起的实际信道。基于所引入的弯曲波束模型，提出了一个优化问题，目标是在发射功率预算和弯曲波束物理约束下最大化所有用户的加权和速率。为了解决该问题，首先通过对连续坐标进行离散采样，将连续孔径转换为有限求和。然后，分析了理想连续孔径设计与其实际离散孔径近似之间的性能差距。基于上述离散近似，开发了一种迭代算法来优化弯曲波束控制参数。具体地，通过分数规划（FP）将原问题重新表述为可处理的形式。然后，通过设计一种增强的块坐标上升（BCA）方法来解决变换后的问题，该方法利用先前迭代的局部下降来确定代理构造点，从而加速收敛。接着，在代理函数中加入近端正则化项以控制更新幅度并抑制激进更新，从而提高更新稳定性。最后，基于有效信道增益计算波束幅度。仿真结果表明，与仅使用直线波束相比，所提方法可以改善加权和速率。

英文摘要

In this paper, the problem of using curved beams to improve wireless communication performance in the presence of a blockage is studied. In particular, a transmitter equipped with a continuous aperture array can generate curved beams to serve multiple receivers by allowing signals to propagate along both straight and curved paths. To optimize the weighted sum-rate, a curved beam model is developed for controlling the beam steering, beam focusing, and beam curving functions, along with a segmented channel model to characterize practical channels induced by the blockage. Based on the introduced curved beam model, an optimization problem is posed with the goal of maximizing the weighted sum-rate of all users under a transmit power budget and physical constraints of curved beams. To solve this problem, the continuous aperture is first converted into finite summations via a discrete sampling of the continuous coordinate. Then, the performance gap between the ideal continuous aperture design and its practical discrete aperture approximation is analyzed. Based on the above discrete approximation, an iterative algorithm is developed to optimize curved beam control parameters. In particular, the original problem is reformulated as a trackable form via fractional programming (FP). Then, the transformed problem is solved by designing an enhanced block coordinate ascent (BCA) method which determines a surrogate-construction point leveraging the local descent from previous iterations, thereby accelerating convergence. Then, a proximal regularization term is included into the surrogate function to control the update magnitude and suppress aggressive update, thereby improving updates stability. Finally, the beam amplitudes are computed based on the effective channel gains. Simulation results show that the proposed method can improve the weighted sum-rate compared to using only straight beam.

URL PDF HTML ☆

赞 0 踩 0

2606.10048 2026-06-10 eess.SP 新提交

Human Walking Sensing and Pose Estimation in the 6 GHz Band Using Amplitude and Phase CSI

使用幅度和相位CSI在6 GHz频段进行人体行走感知与姿态估计

Zhaorui Yin, Mattia Brambilla, Monica Nicoli

AI总结研究利用6 GHz OFDM信号的幅度和相位CSI进行室内人体姿态估计，设计处理流程并适配四种深度学习模型，实验表明幅度CSI性能与联合幅度-相位处理相当，相位信息作为补充特征更有效。

详情

AI中文摘要

本文研究了在6 GHz频段运行的室内多基地无线网络中，利用正交频分复用（OFDM）信号进行人体姿态估计。我们设计并验证了一个处理流程，该流程利用来自多个无线电链路的信道状态信息（CSI）的幅度和相位来估计人体姿态。文献中的四种深度学习架构，即DT-Pose、MetaFi++、HPE-Li和VST-Pose，被适配到OFDM CSI结构，并扩展以联合利用幅度和相位信息。这些模型估计在网络覆盖区域内行走的人体姿态。使用标准姿态估计指标如Procrustes对齐平均每关节位置误差（PA-MPJPE）和骨骼长度损失（BLL）在开放访问数据集上进行性能评估。结果表明，从6 GHz OFDM CSI测量中可以实现可靠的人体姿态重建，其中DT-Pose提供了最佳的整体精度。平均而言，仅幅度CSI的性能与联合幅度-相位处理相当，而相位信息作为补充特征比作为独立输入更有益。

英文摘要

This paper investigates human pose estimation from Orthogonal Frequency-Division Multiplexing (OFDM) signals in an indoor multistatic wireless network operating in the 6 GHz band. We design and validate a processing pipeline that exploits both the amplitude and phase of the Channel State Information (CSI) from multiple radio links to estimate the human body pose. Four deep learning architectures from the literature, namely DT-Pose, MetaFi++, HPE-Li, and VST-Pose, are adapted to the OFDM CSI structure and extended to jointly exploit the amplitude and phase information. The models estimate the pose of a human walking within the network coverage area. Performance evaluation is conducted on an open-access dataset using standard pose-estimation metrics such as Procrustes-aligned Mean Per-Joint Position Error (PA-MPJPE) and Bone Length Loss (BLL). Results indicate that reliable human pose reconstruction can be achieved from 6 GHz OFDM CSI measurements, with DT-Pose providing the best overall accuracy. On average, amplitude-only CSI yields performance comparable to joint amplitude-phase processing, whereas phase information is more beneficial as a complementary feature rather than as a standalone input.

URL PDF HTML ☆

赞 0 踩 0

2606.11125 2026-06-10 eess.SP cs.LG 新提交

DMT: Demographic Conditioning, Morphology-Enhanced Transformer for Cuffless Blood Pressure Estimation from PPG Signals

DMT: 基于人口统计条件与形态增强Transformer的无袖带血压估计方法

Yidan Shen, Neville Mathew, Maham Rahimi, Deependra Dhakal, George Zouridakis, Xin Fu, Renjie Hu

AI总结提出一种基于Transformer的PPG信号无袖带血压估计网络，通过FiLM风格特征调制融入人口统计信息，并添加辅助形态头引导模型关注与动脉僵硬度相关的波形形态，在PulseDB数据集上实现收缩压MAE 4.56 mmHg、舒张压MAE 2.62 mmHg。

详情

AI中文摘要

血压（BP）是心血管风险评估和治疗决策的关键指标，而光电容积描记术（PPG）能够实现低成本、可穿戴友好的无袖带血压估计。然而，即使近期取得了进展，许多基于PPG的模型仅通过血压回归进行训练，可能依赖于以振幅为主的捷径。此外，系统性调节血管顺应性的人口统计协变量通常仅通过后期融合纳入，限制了特定于主体的表示学习。我们提出了一种基于Transformer的网络，用于从PPG信号进行无袖带血压估计，利用自注意力机制捕获多个心动周期之间的长程依赖关系。为了考虑特定主体的血管差异，模型通过Transformer块的注意力和前馈子层中应用的FiLM风格特征调制，以人口统计信息为条件。此外，我们添加了一个辅助形态头，引导模型关注与动脉硬度和波反射相关的血压相关波形形态。在大型PulseDB数据集上基于校准的评估协议下，所提方法在收缩压上实现了4.56 mmHg的平均绝对误差（MAE），在舒张压上实现了2.62 mmHg，与先前的人口统计增强PPG基线相比，误差分别减少了47%和50。由此产生的轻量级单传感器模型支持在启用校准的部署场景中进行可扩展且临床可靠的无袖带血压估计。

英文摘要

Blood pressure (BP) is a key marker for cardiovascular risk assessment and therapeutic decision-making, and Photoplethysmography (PPG) enables low-cost, wearable-friendly cuffless BP estimation. However, even with recent progress, many PPG-based models are trained with BP regression alone and may rely on amplitude-dominated shortcuts. In addition, demographic covariates that systematically modulate vascular compliance are often incorporated only via late fusion, limiting subject-specific representation learning. We propose a Transformer-based network for cuffless BP estimation from PPG signal, leveraging self-attention to capture long-range dependencies across multiple cardiac cycles. To account for subject-specific vascular differences, the model is conditioned on demographics via FiLM-style feature modulation applied through the attention and feed-forward sublayers of Transformer blocks. In addition, we add an auxiliary morphology head to guide the model to attend to BP-relevant waveform morphology associated with arterial stiffness and wave reflection. Under calibration-based evaluation protocols on the large-scale PulseDB dataset, the proposed method achieves MAE of 4.56 mmHg for systolic BP and 2.62 mmHg for diastolic BP, reducing errors by 47% and 50% compared with prior demographic-enhanced PPG baselines. The resulting lightweight, single-sensor model supports scalable and clinically grounded cuffless BP estimation in calibration-enabled deployment settings.

URL PDF HTML ☆

赞 0 踩 0

2606.10972 2026-06-10 eess.AS cs.AI 新提交

Optimizing 2D Input Representations and Sub-phase Fusion Strategies for Differential Diagnosis of Asthma and COPD Using CNN- and GRU-Based Networks

基于CNN和GRU网络的哮喘与COPD鉴别诊断中二维输入表示和子阶段融合策略的优化

Ipek Sen, Ozgur Ozdemir, Elena Battini Sonmez

AI总结本研究优化了二维输入表示（MFCC、对数梅尔谱图、VAR模型）和子阶段特征融合策略（直接拼接、GRU、GRU+注意力），使用CNN和GRU网络鉴别哮喘与COPD，最佳F1分数达0.877。

详情

AI中文摘要

本研究旨在探索VAR模型与梅尔频率倒谱系数（MFCC）矩阵和对数梅尔谱图在深度学习中的性能比较。在肺音分类中，基于谱图的表示因呼吸周期时长不同而存在时间维度不一致的问题。除了传统的裁剪/零填充，还提出了自适应长度窗口来固定时间维度。通过测试一系列参数优化其频谱和时间维度。采用不同的卷积神经网络（CNN）架构从子阶段获得的二维表示中提取特征。然后使用各种策略融合提取的子阶段特征，包括直接拼接、门控循环单元（GRU）网络和带注意力的GRU。通过基于呼吸周期的评估和基于受试者的评估（包含多个呼吸周期）来评估模型性能。还研究了多种数据增强技术以应对数据规模限制。最佳基于周期的F1分数（0.877）通过使用13个系数和每子阶段表示64点时间分辨率的MFCC矩阵，随后进行直接特征拼接获得；最佳基于受试者的F1分数（0.855）通过使用13个系数和每完整周期表示256点时间分辨率的MFCC矩阵获得，两者均采用自适应长度窗口。增强总体上降低了模型性能，但mixup增强是测试方法中最好的。MFCC在区分哮喘和COPD方面优于对数梅尔谱图和VAR模型。复杂的融合策略并未改善诊断。增强没有贡献，表明真实数据在肺音研究中的重要性。

英文摘要

This study aims to explore the performance of the VAR model in comparison with mel-frequency cepstral coefficient (MFCC) matrices and log-mel spectrograms using deep learning. In pulmonary sound classification, spectrogram-based representations suffer from inconsistent temporal dimensions due to varying respiratory cycle durations. Along with traditional trimming/zero-padding, adaptive-length windowing was presented to fix their temporal dimensions. Their spectral and temporal dimensions were optimized by testing a range of parameters. Different convolutional neural network (CNN) architectures were employed to extract features from the two-dimensional representations obtained over the sub-phases. The extracted sub-phase features were then fused using various strategies including direct concatenation, gated recurrent unit (GRU) network and GRU with attention mechanism. Model performances were assessed through respiratory cycle-based evaluation and subject-based evaluation comprising multiple respiratory cycles. Several data augmentation techniques were also studied to cope with limitations in data size. The best cycle-based F1-score (0.877) was obtained using the MFCC matrices with thirteen coefficients and 64-point time resolution per sub-phase representation followed by direct feature concatenation, and the best subject-based F1-score (0.855) was obtained using the MFCC matrices with thirteen coefficients and 256-point time resolution per full-cycle representation, both obtained by adaptive-length windowing. Augmentation degraded the performance of models overall, yet mixup augmentation was the best among the methods tested. MFCC outperformed log-mel spectrogram and VAR model in differentiation of asthma and COPD. Sophisticated fusion strategies did not improve the diagnosis. Augmentation did not contribute, demonstrating the significance of authentic data in pulmonary sound studies.

URL PDF HTML ☆

赞 0 踩 0

2606.10781 2026-06-10 eess.AS cs.CL 新提交

Recovering the Zipfian Distribution in Unsupervised Term Discovery

在无监督术语发现中恢复齐夫分布

Danel Slabbert, Simon Malan, Herman Kamper

AI总结针对无监督术语发现中中心聚类导致分布不均匀的问题，提出图聚类方法，在三种语言上显著优于K-means等，恢复更接近齐夫分布的词汇分布。

详情

AI中文摘要

无监督术语发现涉及将未标记语音分割成词或音节单元，并将这些单元聚类成候选类型的词典。真实词典遵循齐夫分布，然而主流的基于中心的聚类方法——K-means——由于对球形聚类的归纳偏差，产生更均匀的分布。在本文中，我们重新审视基于图的聚类作为一种自下而上的替代方案，其中片段嵌入通过成对相似性连接，并使用Leiden算法进行划分。我们表明，在三种语言的词级和音节级词典发现中，图聚类在性能上显著优于基于中心的方法（K-means、GMM、BIRCH），产生更接近齐夫分布的分布。另一种自下而上的方法，即使用平均链接的凝聚聚类，也表现良好，尽管其计算效率较低，且对结果分布的控制能力较弱。我们的工作质疑了基于中心的聚类在术语发现中的主导地位，并推广图聚类作为一种有吸引力的替代方案。

英文摘要

Unsupervised term discovery involves segmenting unlabelled speech into word- or syllable-like units and clustering these into a lexicon of candidate types. True lexicons follow a Zipfian distribution, yet the dominant centre-based clustering approach -- K-means -- produces a more uniform distribution due to an inductive bias toward spherical clusters. In this paper we revisit graph-based clustering as a bottom-up alternative, where segment embeddings are connected by pairwise similarity and partitioned using the Leiden algorithm. We show that graph clustering substantially outperforms centre-based approaches (K-means, GMM, BIRCH) in both word- and syllable-level lexicon discovery across three languages, producing more Zipf-like distributions. Another bottom-up approach, agglomerative clustering with average linkage, also performs well, although it is computationally less efficient and allows for less control over the resulting distribution. Our work calls into question the dominance of centre-based clustering for term discovery, and promotes graph clustering as an attractive alternative.

URL PDF HTML ☆

赞 0 踩 0

2606.10738 2026-06-10 eess.AS cs.AI 新提交

Spatial-Omni: Spatial Audio Understanding Integration in Multimodal LLMs via FOA Encoding

Spatial-Omni：通过FOA编码在多模态大语言模型中实现空间音频理解

Zhiyuan Zhu, Yixuan Chen, Yiwen Shao, Wenxiang Guo, Changhao Pan, Yu Zhang, Yuxiang Wang, Wei Liu, Houhua Zhang, Chengkuan Zeng, Wenbo Cheng, Yunxi Liu, Rui Yang, Steve Yves, Liefeng Bo, Zhou Zhao

AI总结提出Spatial-Omni，通过SO-Encoder将一阶Ambisonics空间音频注入现有全模态大语言模型，以轻量方式实现空间音频理解，并在构建的SO-Bench基准上超越现有模型。

详情

AI中文摘要

最近的多模态大语言模型主要将音频处理为单声道信号，从而丢弃了空间音频中包含的空间线索，这些线索用于声音定位、空间关系推理和空间场景理解。我们提出Spatial-Omni，一种轻量级方法，通过实现SO-Encoder将一阶Ambisonics（FOA）空间音频作为独立模态注入现有的全模态大语言模型，而无需修改其原始音频编码器。SO-Encoder以有限的额外上下文成本提供空间标记，并通过高效的分阶段训练提升空间音频理解。为支持训练和评估，我们从开源数据、真实录音和仿真中构建了SO-Dataset、SO-QA和SO-Bench，包含40万条FOA空间音频片段和210万个空间问答对。SO-Bench涵盖16个空间音频理解子任务，包括基本检测和位置估计、空间关系理解以及复杂空间推理。实验表明，Spatial-Omni在空间音频理解任务上优于现有的开源大型音频语言模型（LALM）和全模态大语言模型，同时保持合理的通用音频理解水平。代码和数据见：https://this https URL。

英文摘要

Recent multimodal large language models mainly process audio as monaural signals, thereby discarding the spatial cues contained in spatial audio for sound localization, spatial relation reasoning, and spatial scene understanding. We propose Spatial-Omni, a lightweight method that implements SO-Encoder to inject First-Order Ambisonics (FOA) spatial audio into existing Omni LLMs as an independent modality, without modifying their original audio encoders. SO-Encoder provides spatial tokens with limited additional context cost and improves spatial audio understanding through efficient staged training. To support training and evaluation, we construct SO-Dataset, SO-QA, and SO-Bench from open-source data, real recordings, and simulations, containing 400K FOA spatial audio clips and 2.1M spatial question answering pairs. SO-Bench covers 16 spatial audio understanding subtasks, including basic detection and location estimation, spatial relation understanding, and complex spatial reasoning. Experiments show that Spatial-Omni outperforms existing open-source Large Audio-Language Models (LALMs) and Omni LLM models on spatial audio understanding tasks while retaining a reasonable level of general audio understanding. Code and data are available at https://github.com/dieKarotte/Spatial-Omni.

URL PDF HTML ☆

赞 0 踩 0

2606.10713 2026-06-10 eess.IV cs.AI cs.CV cs.LG 新提交

++nnU-Net: Scaling nnU-Net with Prefix-Based Data Augmentation

++nnU-Net: 基于前缀数据增强的nnU-Net扩展

Ana Sofia Santos, André Ferreira, Gijs Luijten, Naida Solak, Lisle Faray de Paiva, Behrus Hinrichs-Puladi, Jens Kleesiek, Jan Egger, Victor Alves

AI总结提出++nnU-Net，通过图像配准进行数据增强，在预处理和训练前生成变形图像，在5个2D数据集上提升Dice系数最高约22%。

详情

Comments: 7 pages, 1 figure, 2 tables

AI中文摘要

nnU-Net在医学分割任务中持续展现出成功，这严重依赖于标注生物医学数据的可用性和多样性。然而，由于隐私法规和标注成本等因素，收集医学影像队列仍然具有挑战性。因此，数据增强在增加数据可用性的同时保持解剖学可行性方面起着关键作用。为此，我们提出了++nnU-Net，一种基于图像配准的新型数据增强模块，在预处理和训练之前运行。我们的框架在五个不同的2D数据集上进行了评估。在该工作流中，图像数据经过两阶段配准过程，生成新的变形图像。然后将变换应用于相应的分割。此外，该管道计算可用磁盘空间，生成补充的二进制合成掩码并生成检查点。我们证明++nnU-Net优于nnU-Net基线，在Dice相似系数得分上有所提升。在最显著的情况下，我们观察到性能提升约22%。这些发现强调了基于配准的数据增强的有效性，特别是对于2D医学影像数据集，并表明++nnU-Net为在数据有限的情况下提高分割性能提供了一种实用且可扩展的方法。++nnU-Net的源代码可在以下网址获取：this https URL

英文摘要

The nnU-Net has demonstrated continuous success in medical segmentation tasks, which heavily rely on the availability and diversity of annotated biomedical data. However, assembling medical imaging cohorts remains challenging due to numerous factors such as privacy regulations and annotation costs. As a result, data augmentation plays a crucial role in increasing data availability while maintaining anatomical feasibility. Hence, we propose the ++nnU-Net, a novel data augmentation module based on image registration that operates prior to preprocessing and training take place. Our framework was evaluated across five different 2D datasets. In this workflow, image data go through a two-stage registration process, generating new warped images. The transformations are then applied to the respective segmentation. In addition, the pipeline computes available disk space, generates supplementary binary synthetic masks and generates checkpoints. We demonstrate that the ++nnU-Net outperforms the nnU-Net baseline, yielding improvements in Dice Similarity Coefficient scores. In the most prominent cases, we observe performance gains of approximately 22\%. These findings highlight the effectiveness of registration-based data augmentation, particularly for 2D medical imaging datasets and suggest that the ++nnU-Net provides a practical and scalable approach for enhancing segmentation performance in data-limited settings. The source code for the ++nnU-Net is available at: https://github.com/sofia-adelie/plusplusnnunet.git

URL PDF HTML ☆

赞 0 踩 0

2606.10454 2026-06-10 eess.AS cs.SD 新提交

Entropy-Aware Domain-Routed Mixture-of-Experts Speech-LLM Framework: A Case Study of Multi-Domain Child-Adult ASR

熵感知域路由混合专家语音-大语言模型框架：多领域儿童-成人ASR案例研究

Mohan Shi, Kaiyuan Zhang, Zilai Wang, Natarajan Balaji Shankar, Eray Eren, Abeer Alwan

AI总结提出一种混合专家语音-大语言模型，通过分类器域路由、混合投影器和混合LoRA模块以及熵感知路由机制，实现跨不同环境和年龄组的统一儿童-成人ASR，在公共儿童语料库上取得一致改进。

详情

Comments: Accepted to Interspeech 2026

AI中文摘要

虽然语音大语言模型在成人自动语音识别上取得了强劲性能，但其对儿童语音的有效性仍未被充分探索，且单一模型往往难以同时处理多样化的成人和儿童年龄组。本文提出一种混合专家语音-大语言模型，用于跨不同环境和年龄组的统一成人及儿童语音ASR。该框架采用基于分类器的域路由，结合粗到细策略，并集成混合投影器和混合LoRA模块以建模域特定变化。为解决域边界附近的路由不确定性，引入熵感知路由机制以动态整合共享专家。在公共儿童语料库上的实验表明，该方法在保持成人ASR性能的同时，相比基线取得了一致改进。据我们所知，这是首个利用语音-大语言模型实现涵盖儿童和成人的统一多领域ASR的工作。

英文摘要

While Speech Large Language Models (Speech-LLMs) have achieved strong performance on adult Automatic Speech Recognition (ASR), their effectiveness on child speech remains under-explored, and single models often struggle to handle diverse adult and child age groups simultaneously. This paper proposes a Mixture-of-Experts (MoE) Speech-LLM for unified ASR across adult and child speech spanning diverse environments and age groups. The framework employs a Classifier-based Domain Router (C-DR) with a coarse-to-fine strategy and integrates both a Mixture-of-Projectors (MoP) and a Mixture-of-LoRAs (MoL) to model domain-specific variations. To address routing uncertainty near domain boundaries, an Entropy-Aware Routing (EAR) mechanism is introduced to dynamically incorporate a shared expert. Experiments on public child corpora demonstrate consistent improvements over baselines while preserving adult ASR performance. To our knowledge, this is the first work leveraging Speech-LLMs for unified, multi-domain ASR encompassing both children and adults.

URL PDF HTML ☆

赞 0 踩 0

2606.10317 2026-06-10 eess.AS cs.SD 新提交

SSL-GMMVC: Interpretable Voice Conversion via Locally Linear GMM Transforms in Self-Supervised Representation Space

SSL-GMMVC：自监督表示空间中通过局部线性GMM变换的可解释语音转换

Tomoya Tanabu, Hiroshi Nishijima, Daisuke Saito, Nobuaki Minematsu

AI总结提出SSL-GMMVC方法，在自监督语音空间中用高斯混合模型建模源-目标特征，通过后验加权仿射变换实现可解释的语音转换，在保持可理解性和自然度的同时提升说话人相似度。

详情

Comments: Accepted to Interspeech2026

AI中文摘要

我们介绍了SSL-GMMVC，一种在自监督语音空间中可解释的语音转换方法。该方法使用高斯混合模型对配对的源-目标特征进行建模，并将转换表示为仿射变换的后验加权和。这产生了适应异质特征空间结构且保持解析可处理性的局部线性变换。通过客观和主观评估，我们表明SSL-GMMVC在保持相当可理解性和自然度的同时提高了说话人相似度，并且随着混合成分数量的增加，即使是受限协方差变体也超过了深度学习基线。进一步的分析将成分选择与语音结构联系起来，并揭示了学习变换中可解释的缩放和旋转。这些发现凸显了SSL-GMMVC作为一种有效且可分析的语音转换框架。

英文摘要

We introduce SSL-GMMVC, an interpretable voice conversion method in self-supervised speech space. The method models paired source-target features with a Gaussian mixture model and performs conversion as a posterior-weighted sum of affine transforms. This yields locally linear transformations that adapt to heterogeneous feature-space structure while remaining analytically tractable. Through objective and subjective evaluations, we show that SSL-GMMVC improves speaker similarity with comparable intelligibility and naturalness, and that even a constrained covariance variant surpasses a deep learning baseline as the number of mixture components increases. Further analyses link component selection to phonetic structure and reveal interpretable scaling and rotation in the learned transforms. These findings highlight SSL-GMMVC as an effective, analyzable framework for voice conversion.

URL PDF HTML ☆

赞 0 踩 0

2606.10301 2026-06-10 eess.SP cs.SY eess.SY 新提交

Fundamentals of NOMA in Low-Earth Orbit Coordinated Multi-Satellite Networks

低轨协调多卫星网络中NOMA的基础原理

Xiangyu Li, Bodong Shang, Junchao Ma, Qingqing Wu, Jie Feng, Deshuang Huang

AI总结研究低轨协调多卫星网络结合非正交多址接入的下行性能，利用随机几何分析覆盖与频谱效率，发现增加协作卫星不一定提升性能，合理功率分配可带来显著增益。

详情

AI中文摘要

协调多卫星（CoMS）传输和非正交多址接入（NOMA）被设想为共同增强卫星网络的覆盖、容量和频谱效率。将它们整合到一个统一的CoMS-NOMA框架中，将实现更高效、可靠和节能的多用户接入。本文从系统级角度研究了CoMS-NOMA网络的下行性能，其中多颗卫星通过NOMA协作服务多个用户。利用随机几何工具，首先推导了CoMS-NOMA中的相关角度和距离作为中间结果。然后，我们获得了组合信号功率分布，并在卫星间和卫星内干扰下分析了覆盖和频谱性能，同时考虑了潜在的不完美连续干扰消除（SIC）。该分析模型在一系列系统参数下得到验证，包括卫星数量、服务区域角度、误差传播因子和功率分配系数。数值结果表明，增加协作卫星的数量并不总是提高覆盖和频谱效率。此外，虽然更高的主瓣增益改善了覆盖，但近乎完美的SIC仅比合理良好的SIC提供稍大的好处。通过适当选择的功率分配系数，与传统的正交和单卫星方案相比，CoMS-NOMA实现了高达270%的覆盖改善和56%的总频谱效率增益，表明其在绿色、节能卫星组网方面的潜力。

英文摘要

Coordinated multi-satellite (CoMS) transmission and non-orthogonal multiple access (NOMA) are envisioned to jointly enhance coverage, capacity, and spectrum efficiency for satellite networks. Their integration into a unified CoMS-NOMA framework will allow more efficient, reliable, and energy-efficient multi-user access. This paper investigates the downlink performance of CoMS-NOMA networks from a system-level perspective, in which multiple satellites cooperatively serve multiple users via NOMA. Leveraging tools from stochastic geometry, related angles and distances in CoMS-NOMA are first derived as intermediate results. Then, we obtain the combined signal power distributions and analyze coverage and spectrum performance under both inter- and intra-satellite interference, accounting for potential imperfect successive interference cancellation (SIC). The analytical model is validated across a range of system parameters, including the number of satellites, service region angle, error-propagation factor, and power allocation coefficients. Numerical results indicate that increasing the number of cooperative satellites does not always improve coverage and spectrum efficiency. Additionally, while a higher main-lobe gain improves coverage, a near-perfect SIC provides only slightly greater benefits than a reasonably good SIC. With properly selected power allocation coefficients, CoMS-NOMA achieves up to a 270% improvement in coverage and a 56% gain in sum spectral efficiency, compared with conventional orthogonal and single-satellite schemes, indicating potential for green, energy-efficient satellite networking.

URL PDF HTML ☆

赞 0 踩 0

2606.10280 2026-06-10 eess.IV cs.CV 新提交

Overlapped Wavelet Diffusion for Low-Light Image Enhancement

重叠小波扩散用于低光照图像增强

Fen Peng, Taizo Suzuki, Seisuke Kyochi

AI总结提出重叠小波扩散框架OWDiff，通过重叠小波变换消除块伪影，并引入低频引导的高频增强模块恢复细节，在LOLv1和LOLv2-real数据集上优于现有方法。

详情

DOI: 10.1587/transinf.2026PCP0006
Journal ref: IEICE Transactions on Information and Systems, Advance online publication, 2026
Comments: Advance published in IEICE Transactions on Information and Systems. DOI: 10.1587/transinf.2026PCP0006. Code: https://github.com/FinnPeg/Overlapped-Wavelet-Diffusion

AI中文摘要

在这项研究中，我们提出了一种用于低光照图像增强（LLIE）的重叠小波扩散框架，该框架包含两个互补组件，以实现无块伪影和细节保持的增强。尽管与传统方法相比，最近基于扩散的LLIE方法表现出显著性能，但DiffLL仍然遭受由Haar小波变换（WT）引起的块伪影以及由于其高频恢复模块（HFRM）的限制导致的边缘模糊或纹理过度平滑。为了克服这些问题，我们引入了重叠小波变换（OWT），它融合了相邻区域的相关性，从而在结构上防止块伪影。此外，我们集成了一个低频引导的高频增强模块（HFEBlock）来加强细节恢复，产生更清晰的边缘和更可靠的纹理。在LOLv1和LOLv2-real数据集上的大量实验表明，我们的框架（称为OWDiff）在定性和定量上均持续优于现有的LLIE方法，在保持计算效率的同时实现了卓越的视觉质量。OWDiff有效解决了Haar WT和HFRM的结构限制，与DiffLL相比，在LOLv1和LOLv2-real数据集上平均PSNR增益为0.58 dB，SSIM相对提高1.64%，LPIPS相对降低5.9%。

英文摘要

In this study, we propose an overlapped wavelet diffusion framework for Low-Light Image Enhancement (LLIE), which incorporates two complementary components to achieve blocking artifact-free and detail-preserving enhancement. Although recent diffusion-based LLIE methods have demonstrated remarkable performance compared with traditional approaches, DiffLL still suffers from blocking artifacts caused by the Haar Wavelet Transform (WT) and blurred edges or over-smoothed textures due to the limitations of its High-Frequency Restoration Module (HFRM). To overcome these issues, we introduce an Overlapped WT (OWT) that incorporates correlations across neighboring regions, thereby structurally preventing blocking artifacts. Furthermore, we integrate a low-frequency-guided High-Frequency Enhance Block (HFEBlock) to strengthen detail recovery, yielding sharper edges and more reliable textures. Extensive experiments on the LOLv1 and LOLv2-real datasets demonstrate that our framework, termed OWDiff, consistently outperforms existing LLIE methods both qualitatively and quantitatively, achieving superior visual quality while maintaining computational efficiency. OWDiff effectively addresses the structural limitations of the Haar WT and the HFRM, achieving an average PSNR gain of 0.58 dB, along with a 1.64% relative improvement in SSIM and a 5.9% relative reduction in LPIPS, compared to DiffLL across both the LOLv1 and LOLv2-real datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.10233 2026-06-10 eess.AS cs.LG cs.SD 新提交

ANCHOR: Autoregressive Non-intrusive Chunk-Ordered Refinement for Joint Multi-Resolution Speech Quality Modeling

ANCHOR: 自回归非侵入式分块有序细化用于联合多分辨率语音质量建模

Zhuoyan Tao, Jiatong Shi, Hye-jin Shim, Shinji Watanabe

AI总结提出ANCHOR模型，将增量语音质量评估重构为多分辨率自回归任务，通过双分辨率令牌和分辨率感知层次实现分块到整句的粗到细细化，在部分输入下显著降低误差，并揭示感知质量的时域积累机制。

详情

Comments: Accepted at Interspeech 2026

AI中文摘要

虽然语音质量通常是在完整话语上评估的，但流式和生成系统需要从部分音频中进行增量估计。现有的预测器假设完整的上下文，在受前缀约束的输入上性能下降。扩展ARECHO，我们提出ANCHOR，将增量评估重新表述为多分辨率自回归任务。它使用双分辨率令牌和分辨率感知层次结构在单个解码器中建模分块级和话语级质量，实现从粗到细的细化。实验表明，在部分输入下具有显著的鲁棒性，包括在2秒前缀上PLCMOS误差减少48%。收敛性分析揭示了4-6秒的有效感知上下文范围。压力测试进一步隔离了局部损坏下的结构化外推偏差。结果表明，层次监督改进了增量预测，并阐明了感知质量如何随时间累积。

英文摘要

While speech quality is typically assessed on complete utterances, streaming and generative systems require incremental estimation from partial audio. Existing predictors assume full context, degrading on prefix-constrained inputs. Extending ARECHO, we propose ANCHOR, reformulating incremental assessment as a multi-resolution autoregressive task. It models chunk- and utterance-level quality within a single decoder using dual-resolution tokens and a resolution-aware hierarchy for coarse-to-fine refinement. Experiments show substantial robustness under partial input, including a 48% PLCMOS error reduction on 2-second prefixes. Convergence analysis reveals a 4-6 s effective perceptual context horizon. A stress test further isolates structured extrapolation biases under localized corruption. Results demonstrate that hierarchical supervision improves incremental prediction and elucidates how perceptual quality accumulates over time.

URL PDF HTML ☆

赞 0 踩 0

2606.10231 2026-06-10 eess.AS cs.SD 新提交

LLM can Read Spectrogram: Encoder-free Speech-Language Modeling

LLM 能读频谱图：无编码器的语音语言建模

Ruchao Fan, Yiming Wang, Yuxuan Hu, Bo Ren, Yufei Xia, Xiaofei Wang, Yao Qian, Jinyu Li

AI总结提出 Mel-LLM，一种无需专用语音编码器、直接将梅尔频谱图补丁通过线性投影输入 LLM 的架构，在 ASR 和 TTS 任务上验证了其可行性，ASR 性能与有编码器方案相当，TTS 初步可行。

详情

AI中文摘要

最近的语音感知大语言模型（Speech-LLMs）依赖预训练的语音编码器将音频转换为 LLM 可消费的语义丰富表示。相反，在这项工作中，我们探索：LLM 能否直接学习读取梅尔频谱图，而无需专用的语音编码器？我们提出 Mel-LLM，一种无编码器的 Speech-LLM，它将经过轻量预处理的梅尔频谱图补丁通过线性投影直接输入 LLM，使 LLM 仅通过自身参数学习语音-文本对齐。我们在自动语音识别（ASR）和文本到语音（TTS）任务上进行了大量实验。对于 ASR，我们在 OpenASR 排行榜公开集和生产级扩展实验上评估，表明无编码器方案在性能上具有竞争力，与有编码器初始化的对应方案相比仅有有限退化。我们发现，当数据有限时，从多模态检查点（Phi-4-MM）初始化对于保持性能至关重要。我们还进行了消融研究，揭示了哪些 LLM 层与语音编码相关性较低。对于 TTS，我们展示了使用下一个令牌 VAE 方法的初步结果。虽然 TTS 性能尚未达到最优，但这些结果确立了用于自回归语音-文本建模的完全统一无编码器架构的可行性。

英文摘要

Recent speech-aware large language models (Speech-LLMs) rely on a pre-trained speech encoder to convert audio into semantic-rich representations consumable by LLM. In this work, instead, we explore: can an LLM learn to read Mel spectrogram directly without a dedicated speech encoder? We propose Mel-LLM, an encoder-free Speech-LLM that feeds lightly pre-processed Mel spectrogram patches directly into the LLM through a linear projection, allowing the LLM to learn speech-text alignment purely through its own parameters. We conduct extensive experiments on both automatic speech recognition (ASR) and text-to-speech (TTS) tasks. For ASR, we evaluate on the OpenASR leaderboard public sets and production-level scaling experiments, demonstrating that the encoder-free solution achieves competitive performance with only limited degradation compared to encoder-initialized counterparts. We find that when data is limited, initialization from a multimodal checkpoint (Phi-4-MM) is crucial for maintaining performance. We also present ablation studies revealing which LLM layers are less relevant to speech encoding. For TTS, we show preliminary results with a next-token VAE approach. While TTS performance is not yet optimal, these results establish the feasibility of a fully unified encoder-free architecture for autoregressive speech-text modeling.

URL PDF HTML ☆

赞 0 踩 0

2606.10010 2026-06-10 eess.AS cs.AI cs.MM cs.SD 新提交

DeRA-MOS: Optimizing Text-to-Music Evaluation via Decoupled Listwise Ranking and Modality Alignment

DeRA-MOS：通过解耦列表排序和模态对齐优化文本到音乐评估

Chien-Chun Wang, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen

AI总结提出DeRA-MOS解耦优化框架，通过批感知列表排序损失和分数锚定模态对齐损失，分别优化音乐印象和文本对齐的排名指标，在MusicEval上显著提升评估性能。

详情

Comments: Accepted to IEEE Signal Processing Letters (SPL)

AI中文摘要

评估文本到音乐（TTM）系统仍然昂贵，因为音乐印象（MI）和文本对齐（TA）分数依赖于人类平均意见分数（MOS）。大多数自动MOS估计器采用逐点回归或分布分类训练。这些目标不直接优化基于排名的指标，并且为跨模态一致性提供较弱的几何约束。为了解决这些问题，我们提出了DeRA-MOS，一种用于TTM评估的解耦优化框架。对于MI，我们引入了一种批感知列表排序损失，该损失对每个小批量内的相对顺序进行建模，并更好地与基于Spearman秩相关系数（SRCC）的评估对齐。对于TA，我们引入了一种分数锚定的模态对齐损失，将人类分数映射到目标音频-文本相似度，并在融合前正则化潜在空间。通过有效缓解逐点训练不匹配和模态漂移，MusicEval上的实验表明，我们的解耦框架在MI和TA排名指标上均取得了显著改进，为大规模TTM评估建立了稳健的范式。

英文摘要

Evaluating text-to-music (TTM) systems remains expensive because music impression (MI) and text alignment (TA) scores rely on human mean opinion scores (MOS). Most automatic MOS estimators are trained with point-wise regression or distributional classification. These objectives do not directly optimize rank-based metrics and provide weak geometric constraints for cross-modal coherence. To address these gaps, we propose DeRA-MOS, a decoupled optimization framework for TTM evaluation. For MI, we introduce a batch-aware listwise ranking loss that models relative order within each mini-batch and better aligns with evaluation based on Spearman's rank correlation coefficient (SRCC). For TA, we introduce a score-anchored modality alignment loss that maps human scores to target audio-text similarity and regularizes the latent space before fusion. By effectively mitigating the point-wise training mismatch and modality drift, experiments on MusicEval demonstrate that our decoupled framework yields substantial improvements in both MI and TA ranking metrics, establishing a robust paradigm for large-scale TTM evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.09953 2026-06-10 eess.IV cs.AI cs.LG 新提交

Deep Slice Interpolation for Reducing Through-Plane Anisotropy and Noise in Head CT

深度切片插值用于减少头部CT的穿平面各向异性和噪声

Luis Cortés Ferre, Miguel A. Gutiérrez-Naranjo, Marcin Balcerzyk

AI总结提出一种深度学习系统，通过相邻轴向切片对合成中间CT切片，将有效穿平面间距减半，同时实现隐式降噪，在结构指标上优于经典插值和视频帧插值方法。

详情

AI中文摘要

头部计算机断层扫描（CT）通常使用亚毫米级的面内分辨率，但穿平面间距为2-5毫米，造成显著的各向异性，这会降低多平面重建、血肿体积估计等体积测量以及假设近似各向同性体素的后续算法的性能。我们提出一个深度学习系统，从相邻轴向切片对合成中间CT切片，将有效穿平面间距减半。该系统改善三维可视化，同时产生固有降噪的输出，在一次推理中实现两个互补优势。为构建可靠系统，我们系统评估像素级损失（均方误差MSE和平均绝对误差L1）、结构相似性损失（结构相似性指数SSIM及其多尺度变体MS-SSIM）以及混合组合。在保留测试集上，所有收敛模型在所有结构指标上均优于经典插值基线和预训练视频帧插值方法（RIFE、FILM），其中MS-SSIM+L1提供最强平衡性能。我们还记录了SSIM族损失中的训练不稳定性并识别部分补救措施：标准数值修复消除了主要失败模式，但在较小批量大小下留下残余发散。所有结果均报告患者级自助法置信区间和配对统计检验。作为示例，我们将系统应用于来自Virgen del Rocío大学医院的非分布头部CT序列：模型合成中间切片，并在真实切片上表现出我们理论分析预测的隐式降噪特征，支持在单个外部病例中插值质量和隐式降噪不局限于训练分布。

英文摘要

Head computed tomography (CT) typically uses sub-millimeter in-plane resolution but 2-5 mm through-plane spacing, creating substantial anisotropy that degrades multiplanar reconstructions, volumetric measurements such as hematoma volume estimation, and downstream algorithms that assume near-isotropic voxels. We present a deep learning system that synthesizes intermediate CT slices from pairs of neighboring axial slices, halving the effective through-plane spacing. The system improves three-dimensional visualization while simultaneously producing inherently denoised outputs, yielding two complementary benefits from a single inference pass. To build a reliable system, we systematically evaluate pixel-wise losses, namely mean squared error (MSE) and mean absolute error (L1); structural-similarity losses, namely the structural similarity index (SSIM) and its multi-scale variant (MS-SSIM); and hybrid combinations. On a held-out test set, all converged models outperform classical interpolation baselines and pretrained video frame interpolation methods (RIFE, FILM) on all structural measures, with MS-SSIM+L1 offering the strongest balanced profile. We also document training instability in SSIM-family losses and identify partial remedies: the standard numerical fixes eliminate the dominant failure mode but leave residual divergence at smaller batch sizes. All results are reported with patient-level bootstrap confidence intervals and paired statistical tests. As an illustration, we apply the system to an out-of-distribution head CT series from Hospital Universitario Virgen del Rocío: the model synthesizes intermediate slices and exhibits on the real slices the implicit-denoising signature predicted by our theoretical analysis, supporting in a single external case that interpolation quality and implicit denoising are not confined to the training distribution.

URL PDF HTML ☆

赞 0 踩 0

2606.09893 2026-06-10 eess.IV cs.AI cs.LG 新提交

Tractogram foundation model

TractFM：纤维束图基础模型

Guikun Chen, Yuqian Chen, Yijie Li, Yogesh Rathi, Nikos Makris, Fan Zhang, Wenguan Wang, Lauren J. O'Donnell

AI总结提出TractFM基础模型，直接从全脑纤维束集学习可复用表示，结合局部纤维编码器和置换等变纤维束编码器，通过密集解剖束分割预训练，实现纤维束级和受试者级任务的迁移。

详情

AI中文摘要

扩散MRI（dMRI）纤维束成像是在活体人脑中绘制白质通路的唯一非侵入性方法。它将每个大脑表示为一个纤维束图：一个大型、无序的三维流线集合，包含局部流线几何和全脑解剖组织的信息。这种结构使纤维束图成为表示学习的自然但具有挑战性的目标。现有方法将流线分类和受试者级预测视为独立问题：流线分类器关注几何模式，而受试者级预测通常依赖于手工特征。因此，当前方法无法学习连接流线解剖与全脑受试者间变异的可复用表示。本文介绍TractFM，一个纤维束图基础模型，直接从全脑纤维束集学习可复用表示。TractFM结合了局部流线编码器和置换等变纤维束编码器，使得一个受试者的所有流线能够在单次前向传递中共同上下文化。在密集解剖束分割（即给单个流线分配解剖标签）上的预训练产生了两种互补表示：用于束分割的上下文化流线级嵌入和用于下游受试者表型预测的紧凑受试者级描述符。在三种纤维束成像算法和五个dMRI数据集上，TractFM迁移到流线级和受试者级任务。其冻结表示实现了准确的束分割，并在独立数据集上预测年龄和性别。这些结果表明，全脑几何上下文（一次性学习）可以泛化到纤维束成像流程、数据集和预测任务中。

英文摘要

Diffusion MRI (dMRI) tractography is the only noninvasive approach for mapping white-matter pathways in the living human brain. It represents each brain as a tractogram: a large, unordered set of three-dimensional streamlines that includes information about both local streamline geometry and whole-brain anatomical organization. This structure makes tractograms a natural but challenging target for representation learning. Existing methods treat streamline classification and subject-level prediction as separate problems: streamline classifiers focus on geometric patterns, whereas subject-level prediction often depends on hand-crafted features. As a result, current methods do not learn reusable representations that connect streamline anatomy with whole-brain inter-subject variation. Here we introduce TractFM, a tractogram foundation model that learns reusable representations directly from whole-brain streamline sets. TractFM combines a local streamline encoder with a permutation-equivariant tractogram encoder, allowing all streamlines from a subject to be contextualized jointly in a single forward pass. Pretraining on dense anatomical tract parcellation, i.e., assigning anatomical labels to individual streamlines, yields two complementary representations: contextualized streamline-level embeddings for tract parcellation and compact subject-level descriptors for downstream prediction of subject phenotypes. Across three tractography algorithms and five dMRI datasets, TractFM transfers to both streamline-level and subject-level tasks. Its frozen representations achieve accurate tract parcellation and predict age and sex across independent datasets. These results show that whole-brain geometric context, learned once, can generalize across tractography pipelines, datasets, and prediction tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.11167 2026-06-10 cs.CL eess.AS 新提交

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

全双工语音模型中的多面交互对齐

Atsumoto Ohashi, Neil Zeghidour, Alexandre Défossez, Eugene Kharitonov

发表机构 * Kyutai ； Gradium

AI总结针对全双工对话模型交互性问题，提出基于强化学习的后训练对齐方法，从暂停处理、话轮转换、回馈和用户打断四个维度优化，并加入LLM奖励防止语义退化，在Moshi和PersonaPlex上取得一致改进。

详情

AI中文摘要

全双工口语对话模型可以同时听和说，使其成为自然对话的有前途的架构。然而，当前模型仅通过令牌级似然最大化的监督学习进行训练，这并未直接优化交互级行为，导致交互性问题，如过度沉默和不合时宜的话轮转换。最近的工作应用强化学习（RL）来改善交互性，但现有方法仅在其奖励中处理有限的一组交互行为。在这项工作中，我们提出了一种后训练对齐方法，通过RL全面改善全双工口语对话模型的交互性。我们解决了交互性的四个典型轴：暂停处理、话轮转换、回馈和用户打断。对于每个轴，我们从人类对话语料库中提取短音频片段，并使用特定于轴的奖励函数优化模型。一个额外的基于LLM的响应质量奖励防止语义退化。我们将我们的方法应用于两个开源模型Moshi和PersonaPlex，在预录音频的离线评估和实时多轮对话评估中均显示出交互性的一致改进。

英文摘要

Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. However, current models are trained solely with supervised learning through token-level likelihood maximization, which does not directly optimize interaction-level behaviors, causing interactivity issues such as excessive silence and ill-timed turn-taking. Recent work has applied reinforcement learning (RL) to improve interactivity, but existing methods address only a limited set of interactive behaviors in their rewards. In this work, we propose a post-training alignment method that comprehensively improves the interactivity of full-duplex spoken dialogue models through RL. We address the four canonical axes of interactivity: pause handling, turn-taking, backchanneling, and user interruption. For each axis, we extract short audio segments from human conversation corpora and optimize the model with axis-specific reward functions. An extra LLM-based reward for response quality prevents semantic degradation. We apply our method to two open-source models, Moshi and PersonaPlex, demonstrating consistent improvements in interactivity on both offline evaluation with pre-recorded audio and real-time multi-turn dialogue evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.11091 2026-06-10 eess.SY cs.SY q-bio.NC 新提交

QUIET: Quantifying Underutilized Influential Edges for Targeted Synchronization

QUIET: 量化未充分利用的影响边以实现目标同步

Sovesh Mohapatra, Christoffer G. Alexandersen, Panagiotis Fotiadis, Max B. Kelz, John A. Detre, Fabio Pasqualetti, Dani S. Bassett

AI总结提出边中心框架QUIET，结合结构可控性和功能互信息识别能量高效的同步路径，验证其在合成网络和人类连接组中的有效性。

详情

Comments: 38 Pages; 6 Figures; 8 SIs

AI中文摘要

网络控制理论可用于建模内在和外在策略以引导神经动力学。标准方法是节点中心、结构性的，并专注于实现期望的瞬时状态。在这里，我们开发了一种边中心方法，该方法结合了结构和功能，以实现由期望同步状态表征的扩展神经动力学模式。我们的方法，量化未充分利用的影响边以实现目标同步（QUIET），是一个边中心框架，它整合了个体白质连接的结构可控性和成对功能时间序列之间的互信息，以识别能量高效的同步路径。QUIET识别安静高速公路，即结构上有影响力但功能上未充分利用的边，以优化区域同步。我们在75种合成配置上验证了QUIET，其中QUIET排名的边集在93%的情况下显著优于随机选择（p<0.01）。该框架在人类连接组计划参与者上测试，揭示了显著性网络同步所需的控制能量与流体智力相关。将QUIET应用于接受右美托咪定诱导无反应的健康成年人，显示额顶叶和默认模式网络在清醒和镇静状态下均表现出同步所需的最大控制能量。QUIET作为独立软件发布，用于研究理论上定义的同步路径，进而可为扰动研究中的可测试假设提供信息。

英文摘要

Network control theory can be used to model intrinsic and extrinsic strategies to steer neural dynamics. Standard approaches are node-centric, structural, and focused on achieving desired instantaneous states. Here, we develop an edge-centric approach which incorporates both structure and function to achieve extended patterns of neural dynamics characterized by desired synchronization states. Our method, Quantifying Underutilized Influential Edges for Targeted Synchronization (QUIET), is an edge-centric framework that integrates structural controllability of individual white matter connections and mutual information between pairwise functional timeseries to identify energy-efficient synchronization pathways. QUIET identifies quiet highways, edges that are structurally influential but functionally underutilized, to optimize regional synchronization. We validated QUIET across 75 synthetic configurations, where QUIET-ranked edge sets significantly outperformed random selection in 93% of cases (p<0.01). The framework, tested on Human Connectome Project participants, revealed that the control energy required for synchronization of the salience network correlates with fluid intelligence. QUIET, applied to healthy adults undergoing dexmedetomidine-induced unresponsiveness, showed that the frontoparietal and default-mode networks exhibited the largest control energy required for synchronization in both awake and sedated states. QUIET is released as a stand-alone software to be used to study theoretically-defined synchronization pathways, which in turn could inform testable hypotheses in perturbative studies.

URL PDF HTML ☆

赞 0 踩 0

2606.11050 2026-06-10 cs.MA cs.GT cs.SY eess.SY 新提交

LLM-Mediated Demand Response Coordination in Smart Microgrids

LLM介导的智能微电网需求响应协调

J. de Curtò, I. de Zarzà

AI总结针对智能微电网中产消者自愿合作的需求响应协调问题，提出一种结合博弈论与LLM叙事评估的混合决策架构，通过结构化指令实现33.3%的需求削减合作率，优于非结构化消息和基线。

详情

Comments: Accepted for publication in 18th International Conference on Sustainability in Energy and Buildings (SEB-26), to appear in Springer Nature proceedings (KES Smart Innovation Systems and Technologies). The final authenticated version will be available online at Springer

AI中文摘要

智能微电网中的有效需求响应要求产消者在战略自利下自愿合作，这一协调问题在结构上等价于社交网络上的重复囚徒困境。本文提出一个多智能体模拟，其中大型语言模型（LLM）影响编译器向异质产消者智能体群体发布结构化需求响应指令，每个智能体由混合决策架构控制，该架构结合了博弈论基础概率（基于收益历史、邻居模仿和利用记忆）与对传入协调信号的LLM叙事评估。该混合架构解决了一个关键方法论挑战：通过人类反馈强化学习（RLHF）对齐的LLM在作为直接决策者时表现出强烈的合作偏差，无论电网条件如何都产生平坦的动态。通过将战略推理与基于情境的叙事评估分离，该模型在六种人格原型中生成真实的产消者行为，基线合作率接近50%，并在影响下表现出明显分化。编译的结构化指令实现了33.3%的需求削减合作率，而非结构化消息为27.0%，无干预基线为28.0%（Δ_comp = +0.063），该优势在基于真实和理想化的智能体基质（Δ = +0.083）以及所有抵抗水平（R = 0.1至0.7）中均保持。通过高中心性网络节点的枢纽定向传播优于外围或随机传播，证实电网拓扑提供了独立于消息内容的机制放大。这些结果表明，结构化LLM编译、基于情境的智能体推理和网络感知传播是可扩展、可解释的需求响应协调的互补设计原则，适用于智慧城市能源系统。

英文摘要

Effective demand response in smart microgrids requires prosumers to cooperate voluntarily under strategic self-interest, a coordination problem structurally equivalent to a repeated Prisoner's Dilemma on a social network. This paper presents a multi-agent simulation in which a Large Language Model (LLM) Influence Compiler issues structured demand-response directives to a population of heterogeneous prosumer agents, each governed by a hybrid decision architecture combining game-theoretic base probability (derived from payoff history, neighbour imitation, and exploitation memory) with LLM narrative evaluation of incoming coordination signals. The hybrid architecture resolves a key methodological challenge: LLMs aligned via Reinforcement Learning from Human Feedback (RLHF) exhibit strong cooperation bias when used as direct decision-makers, producing flat dynamics regardless of grid conditions. By separating strategic reasoning from grounded narrative evaluation, the model generates realistic prosumer behaviour across six personality archetypes, with baseline cooperation near 50% and clear differentiation under influence. Compiled structured directives achieve 33.3% demand-curtailment cooperation versus 27.0% for unstructured messaging and 28.0% for a no-intervention baseline ($Δ_\mathrm{comp} = +0.063$), with the advantage preserved across both grounded and idealized agent substrates ($Δ= +0.083$) and across all resistance levels ($R = 0.1$ to $0.7$). Hub-targeted dissemination via high-centrality network nodes outperforms peripheral or random targeting, confirming that grid topology provides mechanistic amplification independent of message content. These results suggest that structured LLM compilation, grounded agent reasoning, and network-aware targeting are complementary design principles for scalable, interpretable demand-response coordination in smart-city energy systems.

URL PDF HTML ☆

赞 0 踩 0