arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.06444 2026-06-05 eess.AS cs.CL cs.SD

USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding

USAD 2.0:面向通用音频理解的表征蒸馏规模化

Heng-Jui Chang, Alexander H. Liu, Saurabhchand Bhati, Mrudula Athi, Anton Ratnarajah, Amit Chhetri, James Glass

AI总结 提出USAD 2.0通用音频编码器,通过领域感知蒸馏融合自监督和监督基础模型知识,并扩展至音乐领域,经深度缩放达到十亿参数,在探测和基于LLM的评估中取得领先性能。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

音频编码器对于现代音频应用至关重要,因为大型语言模型(LLM)越来越依赖单一编码器处理多样输入。虽然自监督学习(SSL)已产生强大的领域特定编码器(如语音或音乐专家),但像USAD和SPEAR这样的多领域方法在覆盖范围和评估方面仍然有限。最近的研究也表明,监督编码器与音频LLM的对齐效果更好。我们提出USAD 2.0,一种融合了SSL和监督基础模型知识的通用编码器。USAD 2.0引入了领域感知蒸馏来解决教师不匹配问题,将覆盖范围扩展到音乐领域,并增加了用于下游任务的第二阶段监督蒸馏。我们进一步通过深度缩放将模型扩展到十亿参数。实验表明,USAD 2.0在探测和基于LLM的评估中取得了强劲或最先进的性能。

英文摘要

Audio encoders are critical to modern audio applications as large language models (LLMs) increasingly rely on a single encoder for diverse inputs. While self-supervised learning (SSL) has yielded strong domain-specific encoders like speech or music experts, multi-domain approaches like USAD and SPEAR remain limited in coverage and evaluation. Recent studies also suggest supervised encoders align better with audio LLMs. We present USAD 2.0, a universal encoder integrating knowledge from both SSL and supervised foundation models. USAD 2.0 introduces domain-aware distillation to address teacher mismatch, extends coverage to the music domain, and adds second-stage supervised distillation for downstream use. We further scale the model to one billion parameters via depth scaling. Experiments show USAD 2.0 achieves strong or state-of-the-art performance across probing and LLM-based evaluations.

2606.06418 2026-06-05 cs.LG cs.AI cs.SY eess.SY

Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss

双重预处理 (DoPr):针对测试时性能而非验证损失的优化

Thomas T. Zhang, Alok Shah, Yifei Zhang, Vincent Zhang, Nikolai Matni, Max Simchowitz

AI总结 提出双重预处理优化范式,通过结合梯度级和激活级预处理,缓解自回归语言建模等场景中训练/验证损失与下游指标不匹配的测试时反馈问题,提升测试时性能而不一定改善验证损失。

详情
AI中文摘要

深度学习的许多现代应用涉及通过一步预测损失(例如,$L^2$回归、交叉熵)训练神经网络,但部署时沿着其自身预测进行展开。关键例子包括自回归语言建模、基于流的生成建模和机器人策略学习。已有充分证据表明,这些设置会引发我们称为测试时反馈(TTF)的现象:训练/验证损失与下游感兴趣指标(如任务成功率和生成质量)之间的不匹配,且随任务长度增长。虽然数据整理、架构和目标设计已被提出用于对抗TTF设置中的训练-测试偏移,但本文提出优化作为缓解误差累积的新设计轴。具体而言,我们引入了一种称为双重预处理(DoPr)的新优化范式,专门针对TTF的挑战。DoPr将梯度级预处理(如Adam和Muon中的)与激活级预处理(AP)(如KFAC中的)相结合。我们表明,添加AP可以在各种TTF设置中作为一种即插即用的干预手段,提高下游模型性能。有趣的是,这些测试时性能的提升并不总是伴随验证损失的改善,这为如何正确评估使用一步监督目标训练的模型提出了新问题。

英文摘要

Many modern applications of deep learning involve training a neural network via a one-step prediction loss (e.g., $L^2$ regression, cross-entropy), but deploy the network by rolling out along its own predictions. Key examples include autoregressive language modeling, flow-based generative modeling, and robot policy learning. It is well-documented that these settings induce a phenomenon we call test-time feedback (TTF): the mismatch between the training/validation loss and downstream metrics of interest, such as task success rate and generation quality, which grows with task length. While data curation, architecture, and objective design have been proposed to combat train-test shift in TTF settings, this paper proposes optimization as a new design axis to mitigate error accumulation. Specifically, we introduce a new optimization paradigm called double-preconditioning (DoPr) uniquely tailored to the challenges of TTF. DoPr combines gradient-wise preconditioning, as in Adam and Muon, with activation-wise preconditioning (AP), such as in KFAC. We show that the addition of AP yields a drop-in intervention for increasing downstream model performance across a range of TTF settings. Interestingly, these gains in test-time performance do not consistently accompany improvements in validation loss, opening new questions about how to properly evaluate models trained with one-step supervised objectives.

2606.06406 2026-06-05 eess.SY cs.SY

Expected String Stability of Human-Led Vehicle Platoons under Stochastic Communication Delays (Full Version)

随机通信延迟下人类引领车辆队列的期望弦稳定性(完整版)

Francisco Aguilera, Víctor Jaque, Andrés A. Peters, Alejandro I. Maass

AI总结 研究随机通信延迟下,人类驾驶员引领的自主跟随车辆队列的事件触发期望弦稳定性,通过积分不等式导出依赖于完整延迟分布的稳定性条件,并建模为随机混合系统进行验证。

详情
Comments
7 pages, 5 figures, submitted to CPHS 2026
AI中文摘要

本文研究了事件触发车辆队列的期望$\mathcal{L}_2$弦稳定性,其中人类驾驶员在随机通信延迟下引领一串协同控制的自主跟随车辆。领航员的驾驶行为通过车对车(V2V)通信沿队列传播,因此人类引起的扰动不得沿队列放大。与基于最坏情况延迟界限的确定性方法不同,我们通过积分不等式推导出依赖于完整延迟分布的弦稳定性条件。闭环队列被建模为一个随机混合系统,捕捉车辆动力学、通信事件和事件触发。该框架即使在延迟以非零概率超过确定性允许界限时也能保证弦稳定性。使用MATLAB HyEQ模拟器在几种延迟分布下评估了结果。

英文摘要

This paper studies expected $\mathcal{L}_2$ string stability of event-triggered vehicle platoons in which a human driver leads a chain of cooperatively controlled autonomous followers under stochastic communication delays. The leader's driving behavior propagates through the string via vehicle-to-vehicle (V2V) communication, so human-induced disturbances must not amplify along the platoon. Unlike deterministic approaches based on worst-case delay bounds, we derive string-stability conditions depending on the full delay distribution through integral inequalities. The closed-loop platoon is modeled as a stochastic hybrid system capturing vehicle dynamics, communication events, and event-triggering. This framework certifies string stability even when delays exceed deterministic admissible bounds with nonzero probability. Results are evaluated under several delay distributions using the MATLAB HyEQ simulator.

2606.06373 2026-06-05 eess.SP cs.AI

LatentWave: JEPA Pretraining for Wireless Foundation Models

LatentWave: 无线基础模型的JEPA预训练

Ahmed Mohamed, Ahmed Aboulfotouh, Hatem Abou-Zeid

AI总结 提出LatentWave,采用联合嵌入预测架构(JEPA)在潜空间预测掩码区域,学习可迁移的无线信号表示,并在四个下游任务中优于掩码建模基线。

详情
AI中文摘要

无线基础模型已成为为每个无线任务构建单独模型的有前途的替代方案。然而,现有方法依赖于掩码输入重建,这可能会使表示偏向于低级信号细节。在本文中,我们提出了LatentWave,一种无线基础模型,使用联合嵌入预测架构(JEPA)在多样化的无线频谱图和信道状态信息(CSI)上进行预训练。通过在潜空间中预测掩码区域,LatentWave学习到的表示在多种下游任务中具有更好的开箱即用迁移性。所提出的架构在预训练期间采用每通道补丁嵌入和随机通道采样,使其能够处理可变的天线数量,并提高在异构无线配置中的可用性。我们在四个下游任务上评估了LatentWave:射频信号分类、5G NR定位、波束预测和视距/非视距分类,并与在同一数据上预训练的掩码建模基线(WavesFM)进行比较。此外,我们表明掩码几何形状引入了任务相关的归纳偏差:频率掩码强烈有利于与信道相关的任务,如定位和波束预测,而区域掩码则更好地保留信号分类的可区分性。

英文摘要

Wireless foundation models have emerged as a promising alternative to building separate models for each wireless task. However, existing approaches rely on masked input reconstruction, which can bias representations toward low-level signal details. In this paper, we propose LatentWave, a wireless foundation model pretrained using a Joint-Embedding Predictive Architecture (JEPA) on diverse wireless spectrograms and channel state information (CSI). By predicting masked regions in latent space, LatentWave learns representations that are more transferable out of the box across diverse downstream tasks. The proposed architecture employs per-channel patch embeddings with stochastic channel sampling during pretraining, allowing it to process variable antenna counts and improving usability across heterogeneous wireless configurations. We evaluate LatentWave on four downstream tasks: RF signal classification, 5G NR positioning, beam prediction, and LoS/NLoS classification, comparing against a masked-modeling baseline (WavesFM) pretrained on the same data. Additionally, we show that the masking geometry introduces a task-dependent inductive bias: frequency masking strongly favors channel-related tasks such as positioning and beam prediction, while region masking better preserves discriminability for signal classification.

2606.06358 2026-06-05 eess.SY cs.SY eess.SP

Impact of RTK Augmentation and INS Integration on GNSS Positioning Accuracy and Continuity: A Benchmarking Study on Inland Waterways

RTK增强与INS集成对GNSS定位精度和连续性的影响:内河航道基准测试研究

Yan-Yun Zhang, Jef Billet, Jan Swevers, Peter Slaets

AI总结 本研究通过静态基准测试和闭环路径跟踪实验,评估了RTK增强和INS集成在不同配置下对内河航道GNSS定位性能的影响,发现RTK显著提升精度和一致性,INS在RTK中断时提供短期连续性但可能引入漂移。

详情
Comments
8 pages. 6 figures. Accepted to The 10th IEEE Conference on Control Technology and Applications (CCTA) 2026
AI中文摘要

RTK增强和INS集成被广泛用于提高GNSS定位性能。然而,在内河航道上,桥梁和周围结构会降低卫星可见性和校正可用性,导致RTK增强丢失和GNSS/INS融合瞬态。由于这些影响取决于局部环境和传感器配置,标称接收机规格不足,需要针对部署进行特性描述。本文介绍了安装在KU Leuven开发的移动传感器箱内的AsteRx-i3 D Pro+ GNSS/INS接收机的基准测试研究。该研究结合了真实桥梁通道案例研究、静态基准测试和闭环路径跟踪实验。静态基准测试评估了四种接收机配置:独立GNSS、独立GNSS与INS集成、RTK增强GNSS、以及RTK增强GNSS与INS集成。闭环实验使用INS集成GNSS作为导航输入,比较了有和无RTK增强的路径跟踪操作性能。结果表明,桥梁通道期间的校正丢失导致定位精度降低、定位不确定性增加以及恢复引起的状态跳变超过1米。静态基准测试和闭环实验证实,RTK增强显著提高了定位精度和不确定性一致性,而INS集成在RTK不可用期间支持短期连续性,但可能引入漂移、偏差或瞬态不确定性变化。通过描述具有RTK增强和INS集成的部署特定接收机行为,本研究推动了更高层次的状态估计,作为实现内河航道上空间连续和不确定性一致定位的必要下一步。实验数据发布于:https://doi.org/10.5281/zenodo.20541733。

英文摘要

RTK augmentation andINS integration are widely used to improve GNSS positioning performance. However, on inland waterways, bridges and surrounding structures can degrade satellite visibility and correction availability, causing RTK augmentation loss, and GNSS/INS fusion transients. Since these effects depend on the local environment and sensor configuration, nominal receiver specifications are insufficient, and deployment-specific characterization is required. This paper presents a benchmarking study of an AsteRx-i3 D Pro+ GNSS/INS receiver installed within the mobile Sensor Box developed at KU Leuven. The study combines a real-world bridge-passage case study, static benchmarking, and closed-loop path-following experiments. The static benchmarking evaluates four receiver configurations: standalone GNSS, standalone GNSS with INS integration, RTK-augmented GNSS, and RTK-augmented GNSS with INS integration. The closed-loop experiments use INS-integrated GNSS as the navigation input and compare path-following operational performance with and without RTK augmentation. Results show that correction loss during bridge passage causes reduced positioning accuracy, increased positioning uncertainty and recovery-induced state jumps exceeding 1 m. Static benchmarking and closed-loop experiments confirm that RTK augmentation substantially improves positioning precision and uncertainty consistency, while INS integration supports short-term continuity during RTK unavailability but may introduce drift, bias, or transient uncertainty variations. By characterizing the deployment-specific receiver behavior with RTK augmentation and INS integration, this study motivates higher-level state estimation as a necessary next step toward spatially continuous and uncertainty-consistent positioning on inland waterway. The experimental data are released at: https://doi.org/10.5281/zenodo.20541733.

2606.06357 2026-06-05 cs.SD cs.AI eess.AS

F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation

F3-Tokenizer: 驯服音频自编码器潜在变量以支持理解与生成

Dinghao Zhou, Xingchen Song, Di Wu, Pengyu Cheng, Shengfan Shen, Sixiang Lv

AI总结 针对连续音频自编码器潜在变量结构弱、自监督编码器不可解码的问题,提出F3-Tokenizer,通过噪声正则化自编码器瓶颈和潜在侧表示编码器,实现统一的理解与生成音频分词器。

详情
Comments
Technical report; early work; 9 pages, 2 figures, 5 tables
AI中文摘要

连续音频自编码器能很好地重建波形,但通常产生的潜在变量结构较弱,不利于理解;而自监督音频编码器能捕捉语义,但不可直接解码。这种不匹配使得单个音频分词器难以同时支持理解和生成。我们通过两个组件将连续自编码器潜在变量适应于这一场景:噪声正则化的自编码器瓶颈和潜在侧表示编码器。瓶颈使用通道归一化和随机扰动代替基于KL的变分训练,为重建和自回归生成提供尺度可控的连续潜在变量。表示编码器在冻结的自编码器潜在变量上使用RQ-MTP和冻结LLM监督进行训练。最终的分词器为理解提供高维表示,同时保留归一化的连续潜在变量作为生成目标。

英文摘要

Continuous audio autoencoders reconstruct waveforms well but often produce latents with weak structure for understanding, while self-supervised audio encoders capture semantics but are not directly decodable. This mismatch complicates a single audio tokenizer that must support both understanding and generation. We adapt continuous autoencoder latents to this setting with two components: a noise-regularized autoencoder bottleneck and a latent-side representation encoder. The bottleneck uses channel normalization and stochastic perturbation instead of KL-based variational training, yielding scale-controlled continuous latents for reconstruction and autoregressive generation. The representation encoder is trained on frozen autoencoder latents with RQ-MTP and frozen-LLM supervision. The resulting tokenizer provides high-dimensional representations for understanding while preserving normalized continuous latents as generation targets

2606.06347 2026-06-05 eess.SY cs.LG cs.SY

Attack Detection using Time Series Foundation Models

使用时间序列基础模型的攻击检测

Sribalaji C. Anand, Anh Tung Nguyen, George J. Pappas

AI总结 针对无模型知识的网络物理系统,提出基于TimesFM时间序列基础模型的零样本攻击检测方法,在IEEE 14节点电力系统上验证其性能。

详情
Comments
Under review
AI中文摘要

本文解决了在没有任何被控对象模型或其结构知识的情况下,网络物理系统中的攻击检测问题。远程被控对象通过假设受到攻击的网络向操作员传输传感器测量值。我们考虑两类攻击:无模型重放攻击和基于模型的隐蔽攻击。对于后者,我们针对线性与非线性系统,推导了针对$\chi^2$检测器的最优隐蔽攻击策略的闭式表达式。然后,我们提出一种基于TimesFM(Google Research开发的时间序列基础模型)的无模型结构检测器,该检测器以零样本方式作为替代残差生成器运行。实验表明,基于TimesFM的检测器实现了相当或更优的攻击检测性能。在IEEE 14节点电力系统上通过数值实验证明了所提方法的有效性。我们还证明,当经典冗余假设失效时,TimesFM预测可作为受损测量值的替代,这是一种实用的缓解技术。

英文摘要

This paper addresses the problem of attack detection in cyber-physical systems without any knowledge of the plant model or its structure. A remotely located plant transmits sensor measurements to an operator over a network that is assumed to be under attack. We consider two classes of attacks: model-free replay attacks and model-based stealthy attacks. For the latter, we derive closed-form expressions for the optimal stealthy attack policy against a $χ^2$ detector, for both linear and nonlinear systems. We then propose a model-structure-free detector based on TimesFM, a time-series foundation model developed by Google Research, which serves as a surrogate residual generator operating in a zero-shot fashion. We show empirically that the TimesFM-based detector achieves a comparable or superior attack detection performance. The efficacy of the proposed approach is demonstrated numerically on the IEEE 14-bus power system. We also demonstrate that TimesFM predictions can serve as a substitute for corrupted measurements, a practical mitigation technique when classical redundancy assumptions fail.

2606.06239 2026-06-05 eess.SP

Foundation Models for Wireless Communications: From PHY Intelligence to Network Autonomy

无线通信的基础模型:从物理层智能到网络自治

Le Liang, Jiajia Guo, Jun Zhang, Chan-Byoung Chae, Lu Lu, Shugong Xu, Octavia A. Dobre, Shi Jin, Geoffrey Ye Li

AI总结 本文综述了基础模型在无线通信中的应用,从适配预训练模型、构建无线原生模型到智能体模型,推动物理层处理和资源管理向网络自治演进。

详情
AI中文摘要

6G网络将引入前所未有的复杂性,这要求网络优化和管理进行范式转变。基于人工智能的解决方案,特别是由最近开发的基础模型所支持的方案,已被认为是有前途的候选方案。基础模型是具有通用特征提取能力的大规模AI模型,一旦在大量数据上训练完成,它们可以以零样本或少样本微调的方式适应解决各种下游任务。本文全面概述了基础模型如何通过三种渐进范式重塑物理层处理和无线资源管理。首先,我们研究了将现成的预训练基础模型适配到各种无线任务。其次,我们探索了无线原生基础模型,这些模型从头开始在无线数据上构建,以弥合跨领域模态差距并捕获通用的无线领域物理特性。第三,我们强调了智能体基础模型,它将静态数据处理提升为自主的、推理驱动的网络编排。此外,我们讨论了将基础模型应用于新兴6G前沿领域的影响,包括集成感知与通信、新型多输入多输出架构、语义通信和系统级网络自治。最后,我们指出了关键的开源挑战和机遇,为完全智能和自适应的无线网络描绘了一条有前途的道路。

英文摘要

6G networks will introduce unprecedented complexity, which calls for a paradigm shift in network optimization and management. Artificial intelligence (AI)-based solutions, especially those enabled by the recently developed foundation models, have been recognized as promising candidates. Foundation models are large-scale AI models with general-purpose feature extraction capabilities, and once trained on massive amounts of data, they can be adapted to solve a wide range of downstream tasks, either in a zero-shot manner or with few-shot fine-tuning. This article provides a comprehensive overview of how foundation models are reshaping physical-layer processing and wireless resource management across three progressive paradigms. First, we examine the adaptation of off-the-shelf pre-trained foundation models to various wireless tasks. Second, we explore wireless-native foundation models, built from scratch on wireless data to bridge cross-domain modality gaps and capture universal wireless-domain physical characteristics. Third, we highlight agentic foundation models, which elevate static data processing into autonomous, reasoning-driven network orchestration. Furthermore, we discuss the impact of applying foundation models to emerging 6G frontiers, including integrated sensing and communications (ISAC), new multiple-input multiple-output (MIMO) architectures, semantic communications, and system-level network autonomy. Finally, we identify critical open challenges and opportunities, charting a promising path toward fully intelligent and adaptive wireless networks.

2606.06211 2026-06-05 cs.CL cs.SD eess.AS

FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech Recognition

基于FiLM的说话人条件化SpeechLLM用于病理语音识别

Fernando López, Santosh Kesiraju, Jordi Luque

AI总结 本研究提出通过特征线性调制(FiLM)将x-vector说话人信息注入冻结的ASR编码器各Transformer层,实现对病理语音的说话人自适应,在不修改基础模型权重的情况下提升识别性能,并保持对非条件化语音的问答能力。

详情
Comments
Accepted in Odyssey 2026: The Speaker and Language Recognition Workshop
AI中文摘要

自动语音识别(ASR)在标准语音方面取得了显著进展;然而,来自神经系统疾病的病理语音仍然是一个重大挑战。我们研究了通过特征线性调制(FiLM)进行说话人条件化,将x-vector派生信息注入冻结的ASR编码器的每个Transformer层,以在不修改基础模型权重的情况下适应个体病理说话人的内部表示。我们在西班牙语和英语病理语音上,针对ASR任务将其与标准和参数高效微调基线进行基准测试,并辅以后处理。此外,我们评估了自适应模型是否保留了回答语音相关问题的能力。结果表明,说话人条件化的ASR与已建立的适应策略具有竞争力,同时保持了对非条件化语音的性能。

英文摘要

Automatic speech recognition (ASR) has advanced remarkably for standard speech; however, pathological speech from neurological conditions remains a significant challenge. We investigate speaker conditioning via Feature-wise Linear Modulation (FiLM), injecting x-vector-derived information into each transformer layer of a frozen ASR encoder to adapt internal representations to individual pathological speakers without modifying base model weights. We benchmark this for the ASR task against standard and parameter-efficient fine-tuning baselines, complemented by post-processing, on Spanish and English pathological speech. Additionally, we evaluate if the adapted model preserves the ability to answer speech-related questions. Results show that speaker-conditioned ASR is competitive with established adaptation strategies while retaining performance on non-conditioned speech.

2606.06200 2026-06-05 cs.SD eess.AS

Learning Emotion-discriminative Representations for Zero-Shot Cross-lingual Speech Emotion Recognition

学习情感判别表示用于零样本跨语言语音情感识别

Jinyi Mi, Ding Ma, Tomoki Toda

AI总结 针对零样本跨语言语音情感识别中语言分布不匹配和目标语言缺乏情感标注的问题,提出一种结合监督对比学习和说话人对抗学习的情感判别表示学习方法,显著提升了识别性能。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

零样本跨语言语音情感识别(SER)由于语言间的分布不匹配以及目标语言缺乏情感标注而仍然具有挑战性。在这种情况下,仅使用源语言数据训练的模型在评估未见过的目标语言时,常常会出现泛化能力下降的问题。为了解决这一局限性,我们提出了一种情感判别表示学习方法,该方法集成了监督对比学习和说话人对抗学习。对比学习促进了跨语言情感对齐,而说话人对抗学习则抑制了与说话人相关的线索,以鼓励说话人不变的表示。在零样本跨语言SER设置下的实验结果表明,与传统训练策略相比,所提出的方法显著提高了SER性能。

英文摘要

Zero-shot cross-lingual speech emotion recognition (SER) remains challenging due to distribution mismatches across languages and the lack of emotion annotations in target language. Under such conditions, models trained solely on source-language data frequently suffer from degraded generalization when evaluated on unseen target languages. To address this limitation, we propose an emotion-discriminative representation learning method that integrates supervised contrastive learning and speaker adversarial learning. The contrastive learning promotes cross-lingual emotion alignment, while speaker adversarial learning suppresses speaker-related cues to encourage speaker-invariant representations. Experimental results under a zero-shot cross-lingual SER setting demonstrate that the proposed method significantly improves SER performance over conventional training strategies.

2606.06183 2026-06-05 eess.AS cs.CL

Revisiting Lexicon Evaluation in Unsupervised Word Discovery

重新审视无监督词汇发现中的词汇评估

Simon Malan, Danel Slabbert, Herman Kamper

AI总结 针对无监督词汇发现中常用评估指标(归一化编辑距离)偏向大簇质量且忽略真实类别分布的问题,提出两种新指标:修正的簇内一致性指标和逆分布指标,通过实验证明其与真实分布更相关且更鲁棒。

详情
Comments
6 figures
AI中文摘要

从发现的类词单元构建词汇是零资源语音处理的核心目标。但我们的评估是否提供了词汇质量的可靠指示?一个常用指标——归一化编辑距离,平均每个簇中发现单元的音素编辑距离。我们表明该指标固有地偏向大簇的质量,阻碍了公平评估。此外,它忽略了真实类别在簇间的分布情况。基于聚类文献中的既定理论,我们提出了两个解决这些缺点的指标:一个修正的指标,在评估簇内一致性时权衡簇大小;以及一个逆指标,评估真实单词在簇间的分布。通过在合成和真实词汇上的实验,我们证明这些指标组合起来:(1)与词汇接近真实分布的程度更紧密相关,(2)对扭曲词汇评估的偏差更鲁棒。

英文摘要

Building a lexicon from discovered word-like units is a central goal in zero-resource speech processing. But do our evaluations provide a trustworthy indication of lexicon quality? A common metric, normalized edit distance, averages the phoneme edit distances between discovered units in each cluster. We show that this metric has an inherent bias toward the quality of large clusters, inhibiting fair evaluation. Moreover, it ignores how well true classes are distributed across clusters. Based on established theory in clustering literature, we propose two metrics that address these shortcomings: a modified metric that weighs cluster size when assessing within-cluster consistency, and an inverse metric that assesses how true words are spread across clusters. Through experiments on synthetic and real-world lexicons, we demonstrate that combined, these metrics are: (1) more closely correlated with how similar a lexicon is to the ground-truth distribution, and (2) more robust to biases that skew lexicon evaluations.

2606.06173 2026-06-05 eess.SY cs.SY stat.AP

From data to decisions: Bayesian modelling and global sensitivity analysis for flotation control

从数据到决策:浮选控制的贝叶斯建模与全局灵敏度分析

Paulina Quintanilla, Agustin Fuenzalida, Daniel Navia, Pablo Brito-Parada

AI总结 本文提出一个数据驱动框架,集成高斯过程回归与基于Sobol指数的全局灵敏度分析及SHAP局部可解释性,用于浮选系统的可解释建模和决策支持,通过实验室数据建立静态GP代理模型,识别影响空气回收率的关键变量及其交互作用。

详情
AI中文摘要

本工作提出了一个数据驱动框架,用于浮选系统的可解释建模和决策支持,集成了高斯过程回归与基于Sobol指数的全局灵敏度分析以及使用SHapley Additive exPlanations (SHAP)的局部可解释性。基于实验室规模的实验数据,开发了一个静态GP代理模型,以捕捉表观气速、溢流泡沫速度、唇口上泡沫高度、矿浆高度、气泡尺寸和尾矿流量如何影响测量的空气回收率。训练好的GP能够计算Sobol指数,以量化每个变量及其交互作用对空气回收率总体方差的贡献。贝叶斯推断与基于Sobol的灵敏度度量的结合提供了一种系统的方法来识别控制空气回收率的主导变量和交互变量。本研究将贝叶斯学习、灵敏度量化和可解释性联系起来,为浮选过程的数据驱动控制和优化提供了基础。

英文摘要

This work presents a data-driven framework for interpretable modelling and decision support in flotation systems, integrating Gaussian Process (GP) regression with Global Sensitivity Analysis (GSA) via Sobol indices and local interpretability using SHapley Additive exPlanations (SHAP). Based on laboratory-scale experimental data, a static GP surrogate model is developed to capture how superficial air velocity, overflowing froth velocity, froth height over the lip, pulp height, bubble size, and tailings flowrate influence the measured air recovery. The trained GP enables the computation of Sobol indices to quantify the contribution of each variable and their interactions to the overall variance in air recovery. The combination of Bayesian inference and Sobol-based sensitivity metrics provides a systematic approach to identify the dominant and interacting variables governing air recovery. This study links Bayesian learning, sensitivity quantification, and explainability to provide a foundation for data-driven control and optimisation of flotation processes.

2606.06170 2026-06-05 eess.AS

CoSTA: Cognitive-State-Conditioned TTS Data Augmentation Using ASR Transcripts for Alzheimer's Disease Detection

CoSTA: 基于认知状态条件的TTS数据增强,使用ASR转录文本用于阿尔茨海默病检测

Yin-Long Liu, Yuanchao Li, Yiming Wang, Yue Li, Rui Feng, Jiaxin Chen, Shaobo Liu, Liu He, Yuang Chen, Jiahong Yuan, Zhen-Hua Ling

AI总结 提出CoSTA框架,通过认知状态条件TTS模型合成语音,结合ASR转录文本进行数据增强,在ADReSS数据集上实现85.83%的音频检测准确率。

详情
Comments
Accepted by Interspeech 2026
AI中文摘要

基于语音的阿尔茨海默病(AD)检测受限于稀缺的病理语音数据。为此,我们提出CoSTA,一种基于文本转语音(TTS)的数据增强框架。具体而言,我们首先通过适配CosyVoice2和F5-TTS开发了两个认知状态条件(CS-Cond)TTS模型,以合成具有不同AD和健康对照特征的语音。此外,通过构建包含人工转录(MT)和36个自动语音识别(ASR)转录的转录池,我们研究了文本来源对基于TTS的数据增强的影响。我们还进行了增强因子分析和测试时增强。在ADReSS数据集上的实验表明,CS-Cond TTS显著提升了合成语音的效用,且ASR驱动的增强通常优于MT驱动的增强。最后,CoSTA相比基线获得了4.16%的提升,在ADReSS测试集上实现了85.83%的纯音频准确率,并超越了先前的方法。

英文摘要

Speech-based Alzheimer's Disease (AD) detection is constrained by scarce pathological speech data. To address this, we propose CoSTA, a Text-to-Speech (TTS)-based data augmentation framework. Specifically, we first develop two Cognitive-State-Conditioned (CS-Cond) TTS models by adapting CosyVoice2 and F5-TTS to synthesize speech with distinct AD and Healthy Control characteristics. Furthermore, by constructing a transcript pool comprising Manual Transcripts (MT) and 36 Automatic Speech Recognition (ASR) transcripts, we investigate the impact of text sources on TTS-based augmentation. We also perform augmentation-factor analysis and test-time augmentation. Experiments on the ADReSS dataset show that CS-Cond TTS significantly improves synthetic speech utility, and ASR-driven augmentation frequently outperforms MT-driven augmentation. Finally, CoSTA yields a 4.16% gain over the baseline, achieving an audio-only accuracy of 85.83% on the ADReSS test set and outperforming prior methods.

2606.06167 2026-06-05 eess.SY cs.SY

Voltage Unbalance-Aware AC Optimal Power Flow in Distribution Networks

配电网中考虑电压不平衡的交流最优潮流

Alireza Zabihi, Luis Badesa, Araceli Hernandez

AI总结 针对单相负载和分布式电源渗透加剧配电网电压不平衡问题,提出一种改进混合限值(IHL)方法,通过在三相交流最优潮流模型中嵌入电压不平衡约束或惩罚项,实现快速收敛且符合电网规范的电压不平衡缓解。

详情
AI中文摘要

单相负载和分布式电源的日益渗透加剧了配电网中的电压不平衡(VU),引发了对电能质量的担忧并使网络运行复杂化。然而,大多数市场出清模型和基于价格的协调框架并未在三相交流表示中强制执行VU限制,因此对电网规范符合性、数值可扩展性和经济信号的影响尚不明确。本文将VU嵌入三相交流最优潮流市场出清模型中,并基准测试了两种处理方法:严格的VU限制强制执行和目标函数惩罚。基于这些见解,提出了一种改进混合限值(IHL)公式,该公式在保持合规性的同时,在目标中使用平滑的不平衡代理来引导优化求解器。在欧洲低压馈线上的案例研究表明,IHL维持了可行的运行点,产生了与传统混合公式一致的出清价格和削减信号,并且比基于精确不平衡度量的惩罚方法收敛更快、更可靠。这些结果支持IHL作为不平衡配电系统基于市场运行中VU缓解的实用且可扩展的机制。

英文摘要

The increasing penetration of single-phase loads and distributed generation exacerbates voltage unbalance (VU) in distribution grids, raising concerns about power quality and complicating network operation. However, most market-clearing models and price-based coordination frameworks do not enforce VU limits within a three-phase AC representation, so the implications for grid-code compliance, numerical scalability, and economic signals remain unclear. This paper embeds VU in a three-phase AC optimal power flow market-clearing model and benchmarks two treatments: strict VU limit enforcement and objective function penalization. Building on these insights, an Improved Hybrid Limits (IHL) formulation is proposed that preserves compliance while using a smooth unbalance proxy in the objective to guide the optimization solver. Case studies on a European low-voltage feeder show that IHL maintains feasible operating points, yields price and curtailment signals consistent with conventional hybrid formulations, and converges substantially faster and more reliably than a penalization based on the exact unbalance metric. These results support IHL as a practical and scalable mechanism for VU mitigation in market-based operation of unbalanced distribution systems.

2605.03753 2026-06-05 math.OC cs.NE cs.SY eess.SY

Exact and Evolutionary Algorithms for Sequential Multi-Objective Transmission Topology Planning

顺序多目标输电拓扑规划的精确与进化算法

Job Groeneveld, Miguel Muñoz, Jan Viebahn, Alessandro Zocca

AI总结 针对高压电网日前拓扑控制,提出一种精确块算法和基于NSGA-III的进化启发式算法,在N-1安全约束下优化线路负载、拓扑深度、切换次数和参考拓扑外时间,精确算法可在三分钟内获得完整Pareto前沿。

详情
Comments
27 pages, 6 figures
AI中文摘要

我们研究了在$N-1$安全约束下高压电网运行的日前输电拓扑控制。操作任务是在24小时范围内,通过母线耦合器切换选择一系列变电站拓扑,以缓解线路过载,同时限制切换努力和拓扑复杂性。我们将此任务建模为一个顺序多目标优化问题,包含TSO决策中使用的四个目标:最坏情况$N-1$线路负载、最大拓扑深度、拓扑变化次数以及在参考拓扑外的时间。我们提出了一种精确块算法,利用拓扑计划的时间结构:连续相同拓扑的小时表示为块,从而能够在深度和切换的固定操作边界内枚举可接受拓扑集上的完整Pareto前沿。我们还开发了一种基于NSGA-III的定制进化启发式算法,并针对精确前沿进行了评估。使用TenneT运营的荷兰高压电网的实际运行数据,该块算法在拓扑级潮流预处理后,在不到三分钟内计算出了高度拥堵日的精确前沿。精确前沿揭示了低切换计划,这些计划没有直流$N-1$热过载,而测试的进化搜索未能找到。因此,所提出的方法既为输电运营商提供了实用的日前决策支持工具,也为启发式和基于学习的拓扑控制方法提供了基准。

英文摘要

We study day-ahead transmission topology control for high-voltage grid operation under $N-1$ security constraints. The operational task is to select, over a 24-hour horizon, a sequence of substation topologies obtained via busbar-coupler switching to relieve line overloads while limiting switching effort and topological complexity. We formulate this task as a sequential multi-objective optimization problem with four objectives used in TSO decision making: worst-case $N-1$ line loading, maximum topological depth, number of topology changes, and time spent outside the reference topology. We propose an exact block algorithm that exploits the temporal structure of topology plans: consecutive hours with the same topology are represented as blocks, enabling enumeration of the complete Pareto front over the admissible set of topologies under fixed operational bounds on depth and switching. We also develop a tailored NSGA-III-based evolutionary heuristic and evaluate it against the exact front. Using real operational data from the Dutch high-voltage transmission grid operated by TenneT, the block algorithm computes the exact front for a highly congested day in under three minutes after topology-level load-flow preprocessing. The exact front reveals low-switching plans with no DC $N-1$ thermal overloads that the tested evolutionary search fails to find. The proposed method, therefore, provides both a practical day-ahead decision-support tool for transmission operators and a benchmark for heuristic and learning-based topology-control methods.

2606.06110 2026-06-05 eess.SP

Subarray based Wideband Beamforming and Variational Sparse CSI Estimation for Low-Resolution MU THz MIMO Systems

基于子阵列的宽带波束成形与变分稀疏CSI估计用于低分辨率MU THz MIMO系统

Abhisha Garg, Suraj Srivastava, Akash Kumar, Aditya K. Jagannatham

AI总结 针对太赫兹MIMO系统的硬件限制和频率相关传播效应,提出基于变分贝叶斯推理的统一信道估计与波束成形框架,采用Bussgang分解处理低分辨率ADC非线性,并利用真时延混合收发器补偿波束斜视效应,实现高精度信道估计和宽带波束对准。

详情
Journal ref
IEEE Transactions on Vehicular Technology 2026
AI中文摘要

本工作构思了一个统一的信道估计和波束成形框架,该框架基于变分贝叶斯推理原理制定。考虑到太赫兹(THz)频段中硬件约束、频率相关传播效应以及部分连接架构的结构限制,我们构建了一个双宽带信道模型,其中包含根升余弦(RRC)脉冲形状以考虑其带限特性。为了进一步解决低分辨率ADC引入的非线性失真,采用了Bussgang分解,从而实现了可处理的线性化推理过程。与传统技术不同,所提出的方法同时适用于网格上和网格外角度域,以更高的分辨率和鲁棒性捕获空间稀疏性。还推导了多用户(MU)贝叶斯克拉美-罗下界,以基准测试所提估计器的性能。此外,该框架包含一个基于真时延(TTD)的混合收发器设计,该设计固有地补偿了波束斜视效应——一种由于宽带系统中传统波束形成器的固定相位特性而产生的频率相关角度偏差,从而确保在所有子载波上实现精确的方向对准。广泛的仿真结果验证了所提出的基于变分贝叶斯推理的估计器和TTD支持的波束成形架构的有效性,突显了它们在实际宽带太赫兹系统中的鲁棒性和性能增益。

英文摘要

This work conceives a unified channel estimation and beamforming framework, formulated within the principles of variational Bayesian inference. Recognizing the limitations imposed by hardware constraints, frequency-dependent propagation effects, and the structural restrictions of partially connected architectures in the Terahertz (THz) band, we formulate a dual-wideband channel model incorporating root raised cosine (RRC) pulse shape to account its band-limited nature. To further address the nonlinear distortions introduced by low-resolution ADCs, Bussgang decomposition is employed, enabling a tractable linearized inference process. Unlike conventional techniques, the proposed method accommodates both on-grid and off-grid angular domains, capturing spatial sparsity with improved resolution and robustness. The multi-user (MU) Bayesian Cramér-Rao lower bound is also derived to benchmark the performance of the proposed estimator. Moreover, the framework incorporates a true time delay (TTD)-based hybrid transceiver design that inherently compensates for the beam-squint effect; a frequency-dependent angular deviation that arises due to the fixedphase nature of the conventional beamformer in wideband systems, thereby ensuring accurate directional alignment across all subcarriers. Extensive simulation results validate the effectiveness of the proposed variational Bayesian inference-based estimator and the TTD-enabled beamforming architecture, highlighting their robustness and performance gains under practical wideband THz system.

2606.06040 2026-06-05 cs.RO cs.SY eess.SY

Gotta Grow Fast: Design and Benchmarking of a Tip Mount for High-Speed Vine Robots

快速生长:高速藤蔓机器人尖端支架的设计与基准测试

Antonio Alvarez Valdivia, Robert Reeve, Ankush Dhawan, Ciera McFarland, Chad Council, Margaret McGuinness, Nathaniel Hanson

AI总结 提出一种三角滚轮尖端支架,通过滚动代替滑动减少生长阻力,实现TPU涂层防撕裂尼龙藤蔓机器人的一致外翻,并建立可重复的基准测试框架。

详情
Comments
Accepted to IEEE Robotics & Automation Letters
AI中文摘要

软体生长藤蔓机器人通过尖端外翻机制扩展,该机制使其能够在杂乱环境中导航。然而,在尖端集成摄像头和其他传感器具有独特挑战,因为形成尖端的材料随着机器人生长而不断更新。这种持续的材料更替,加上内层之间的摩擦、增加的尖端重量和织物收缩,使传感器和工具安装复杂化。这些限制阻碍了藤蔓机器人在检查和搜索任务中的应用,而快速生长并携带尖端传感器至关重要。在这项工作中,我们提出了一种三角滚轮尖端支架,通过滚动而非滑动与机器人本体接触,减少生长过程中的内部阻力。通过迭代故障分析优化设计,首次实现了在TPU涂层防撕裂尼龙藤蔓机器人上的一致外翻。为了定量评估支架性能,我们引入了一个定制测试台,通过测量外翻过程中的尾部张力来隔离尖端安装效应。跨多个支架变体(包括先前设计)的比较实验表明,我们的三角滚轮支架实现了最低的尾部张力和最可重复的生长性能。这些结果既建立了一个经过验证的尖端支架设计,也为推进软体生长机器人中传感器和工具集成提供了一个可重复的基准测试框架。支架和测试台的CAD文件可在以下网址获取:https://sprout-mitll.github.io/tip_mounts/。

英文摘要

Soft, growing vine robots extend through tip eversion, a mechanism that enables navigation through cluttered environments. However, integrating cameras and other sensors at the tip is uniquely challenging because the material forming the tip is constantly renewed as the robot grows. This continual material turnover, combined with friction between internal layers, added tip weight, and fabric constriction, complicates sensor and tool mounting. These limitations hinder the deployment of vine robots for inspection and search tasks, where rapid growth while carrying tip-mounted sensors is essential. In this work, we present a triangular roller tip mount that reduces internal resistance during growth by rolling rather than sliding against the robot body. The design was refined through iterative failure analysis, enabling, for the first time, consistent eversion on a TPU-coated ripstop nylon vine robot. To quantitatively evaluate mount performance, we introduce a custom testbed that isolates tip mounting effects by measuring tail tension during eversion. Comparative experiments across multiple mount variants, including prior designs, show that our triangular roller mount achieves the lowest tail tension and most repeatable growth performance. These results establish both a validated tip mount design and a repeatable benchmarking framework for advancing sensor and tool integration in soft growing robots. CAD for the mount and testbed is available at: https://sprout-mitll.github.io/tip_mounts/.

2606.05994 2026-06-05 cs.LG eess.SP

HoT-SSM:Higher-order Temporal Knowledge Graph Reasoning with State Space Models for Health Care

HoT-SSM:用于医疗保健的高阶时序知识图谱推理与状态空间模型

Thummaluru Siddartha Reddy, Vempalli Naga Sai Saketh, Yash Punjabi, Mahesh Chandran

AI总结 提出HoT-SSM模型,通过构建超图捕获高阶临床交互,并利用动态超图状态空间模型建模长程时序依赖,在MIMIC-III/IV数据集上显著提升临床预测性能。

详情
Comments
Paper under review
AI中文摘要

融合临床知识的医学知识图谱(MKGs)越来越多地被用于建模电子健康记录(EHRs),以支持医疗领域的可解释预测。然而,现有的基于MKG的方法在捕获临床概念(如病情、手术和药物)之间的成对关系方面存在局限,限制了其建模共现或语义相关概念间高阶交互的能力。此外,大多数利用MKG的表示学习方法要么跨就诊折叠时间信息,要么缺乏显式建模长程时序依赖的机制,而这对于死亡率预测等临床任务至关重要。为缓解这些局限,我们提出HoT-SSM,一种参数高效的高阶时序图推理方法,结合状态空间模型。对于每次就诊,HoT-SSM通过利用领域知识将语义相关的临床概念分组为超边来构建超图,从而保留就诊级别的临床上下文。此外,为在学表示的同时建模时序动态,我们引入一种新颖的基于动态超图的状态空间模型,显式捕获患者潜在状态随时间演变,同时保留长程信息。学到的表示用于下游临床预测和推理。在MIMIC-III和MIMIC-IV数据集上的实验表明,性能显著优于当前最先进模型,证明了联合建模高阶临床交互和长程时序依赖的有效性。

英文摘要

Medical knowledge graphs (MKGs) infused with clinical knowledge have been increasingly used to model electronic health records (EHRs) to support interpretable predictions in healthcare domain. However, existing MKG-based approaches are limited in capturing pairwise relations between clinical concepts (e.g., conditions, procedures, and medications), and restricts their ability to model higher-order interactions among co-occurring or semantically related concepts. In addition, most representation learning methods that leverage MKGs either collapse temporal information across visits or lack an explicit mechanism for modeling long-range temporal dependencies, which is critical for clinical tasks such as mortality prediction. To mitigate these limitations, we propose HoT-SSM, a parameter efficient and higher-order temporal graph reasoning with state space models. For each visit, HoT-SSM constructs hypergraphs by grouping semantically related clinical concepts into hyperedges using domain knowledge, thereby preserving visit-level clinical context. Further, to model the temporal dynamics while learning the representations, we introduce a novel dynamic hypergraph-based state space model that explicitly captures patients latent state evolution over time while preserving long-range information. The learned representations are used for downstream clinical prediction and reasoning. Experiments on MIMIC-III and MIMIC-IV datasets shows significant performance improvement over the current state-of-the-art models, demonstrating the effectiveness of jointly modeling higher-order clinical interactions and long-range temporal dependencies.

2606.05993 2026-06-05 cs.IT cs.SY eess.SP eess.SY math.IT

Double-Directional Wireless Channel Modeling Using Statistics-Aided Machine Learning

基于统计辅助机器学习的双向无线信道建模

Richmond Boamah, Ferdous Pervej

AI总结 提出一种统计辅助的机器学习方法,通过选择前M个多径分量并构建可学习图来训练混合TimesNet-TimeFilter模型,以生成未来双向信道实现,其统计特性与完整时变信道匹配。

详情
AI中文摘要

双向(DD)无线信道模型对于现实系统设计非常重要,因为它提供了完整的传播信息。虽然随机和确定性信道模型被广泛采用,且现有的机器学习(ML)解决方案大多旨在对齐未来的信道实现,但这些解决方案通常局限于可能不具有统计显著性的短时间跨度。此外,由于多径分量(MPC)的数量随接收器(RX)和/或交互物体(IO)的空间和时间变化而变化,需要固定、预定义输入和输出形状的典型ML解决方案无法胜任。为了克服这些限制,我们提出了一种统计辅助的ML解决方案,该方案依赖于固定子集的MPC选择。更具体地说,我们首先选择前$M$个MPC,其中$M\in\mathbb{Z}^+$远小于MPC总数,并构建可学习图来训练我们提出的混合TimesNet-TimeFilter(TNTF)模型。然后,我们使用信道统计辅助的训练方法来生成未来的前M个DD信道实现,使得从这些实现计算出的统计量与来自完整时变DD信道实现的实际统计量紧密匹配。我们通过在合成随机信道模型(SCM)和基于确定性射线追踪的数据集上进行大量仿真来验证所提出的解决方案,并展示了其相对于最先进基线的有效性。

英文摘要

The double-directional (DD) wireless channel model is important for realistic system design since it provides complete propagation information. While stochastic and deterministic channel models are widely adopted, and existing machine learning (ML) solutions mostly aim to align future channel realizations, these solutions are often limited to short time spans that may not be statistically significant. Moreover, because the number of multi-path components (MPCs) varies with spatial and temporal variation of the receiver (RX) and/or interacting objects (IOs), typical ML solutions that require fixed, predefined input and output shapes fall short. To curb these limitations, we propose a statistics-aided ML solution that relies on a fixed subset of MPCs selection. More specifically, we first select top-$M$ MPCs, where $M\in\mathbb{Z}^+$ is much smaller than the total number of MPCs, and construct learnable graphs to train our proposed hybrid TimesNet-TimeFilter (TNTF) model. We then use a channel statistics-aided training method to generate future top-M DD channel realizations such that the statistics calculated from these realizations matches closely with those of the actual statistics from the complete time-varying DD channel realizations. We validate the proposed solution using extensive simulations on both synthetic stochastic channel model (SCM)-based and deterministic ray-tracing-based datasets, and demonstrate its effectiveness relative to state-of-the-art baselines.

2606.05931 2026-06-05 cs.CL cs.AI cs.CV cs.IR cs.LG cs.MM eess.AS

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

多模态还是非多模态:通过主动模态检测的查询自适应音视频人物检索

Erfan Loweimi, Mengjie Qian, Kate Knill, Guanfeng Wu, Chi-Ho Chan, Abbas Haider, Muhammad Awan, Josef Kittler, Hui Wang, Mark Gales

AI总结 提出一种查询自适应框架,通过跨模态分数一致性检测主动模态,在BBC Rewind语料库上达到94.2%的P@1,优于单模态和固定融合方法。

详情
Comments
INTERSPEECH 2026
AI中文摘要

当通过语音和面部从视频档案中检索一个人时,系统应该是多模态的吗?在实际的广播档案中,与精心策划的基准不同,目标可能只被听到但未被看到、只被看到但未被听到,或者两者兼有。融合来自缺失模态的分数会引入噪声,使精度低于最佳单模态系统。我们提出了一种查询自适应框架,通过跨模态分数一致性检测主动模态:当两种模态都活跃时,由一种模态检索的文件在另一种模态上也得分高;当一种模态缺失时,这种一致性被破坏。由这些跨模态特征驱动的分类器实现了89%的检测准确率。在BBC Rewind语料库(包含超过12,000个广播视频)上,自适应系统达到了94.2%的P@1,优于仅语音(82.9%)、仅面部(93.4%)和固定融合(90.0%),恢复了与具有真实模态标签的Oracle(96.6%)之间差距的64%。

英文摘要

When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).

2606.05911 2026-06-05 cs.SD cs.LG eess.AS

DBHN-Net: Dual-Branch Hybrid Neural Network For Low-Complexity Monaural Speech Enhancement

DBHN-Net: 低复杂度单声道语音增强的双分支混合神经网络

Cunhang Fan, Enrui Liu, Jing Zhou, Jian Kang, Jie Li, Andong Li, Jian Zhou, Zhao Lv, Xuelong Li

AI总结 提出一种结合ANN和SNN的双分支混合神经网络,通过BandSplit、TF-Mamba等模块降低计算复杂度,同时利用交互和融合模块保持性能,在三个公共数据集上实现平均7.5倍复杂度降低。

详情
Journal ref
IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI2026)
Comments
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI)
AI中文摘要

尽管基于人工神经网络(ANN)的语音增强(SE)方法表现出色,但高计算复杂度和高能耗阻碍了它们在实际前端处理任务中的部署。目前,脉冲神经网络(SNN)在降低功耗方面显示出潜力。然而,SNN的离散二进制激活和复杂的时空动态常常导致信息丢失。因此,当前的挑战集中在如何保持性能并降低计算复杂度。为了解决这个问题,本文提出了一种双分支混合神经网络(DBHN)。1)在网络架构方面:设计了一个集成ANN和SNN的双分支网络,其中SNN分支降低功耗,而ANN分支解决信息丢失;开发了BandSplit和时频(TF)-Mamba模块,以同时压缩能耗和增强模型性能;实现了带有残差连接的脉冲特征提取组(SFEG)和信息转换块(ITB)组件,以减轻信息丢失,同时进一步细化特征表示。2)为了促进分支间的信息融合:设计了一个交互模块,以促进双分支网络各个阶段的信息交换;设计了一个TF交叉注意力融合模块,在数据自适应地引导SNN分支保留更多关键信息的同时,对双分支信息进行时频域融合。结果表明,所提出的模型在三个公共数据集上保持了优越的性能,同时与基线模型相比,计算复杂度平均降低了7.5倍。

英文摘要

Although artificial neural network (ANN) based speech enhancement (SE) methods demonstrate excellent performance, the high computational complexity and high energy consumption hinder their deployment in practical front-end processing tasks.} Currently, the spiking neural networks (SNNs) have shown potential in reducing power consumption. However, the discrete binary activation and complex spatio-temporal dynamics of SNNs often result in information loss. The current challenge therefore focuses on how to maintain performance and reduce computational complexity. To address this issue, this work propose a Dual-Branch Hybrid Neural (DBHN) Network. 1) In terms of network architecture: A dual-branch network integrating ANN and SNN was designed, where the SNN branch reduces power consumption while the ANN branch addresses information loss; The BandSplit and Time-Frequency (TF) -Mamba modules were developed to simultaneously compress energy consumption and enhance model performance; Spiking Feature Extraction Group (SFEG) and Information Transformation Block (ITB) components were implemented with residual connections to mitigate information loss while further refining feature representations. 2) To facilitate inter-branch information fusion: An Interaction module was designed to promote information exchange at various stages of the dual-branch network; A TF-Cross Attention-Fusion module was designed to perform time-frequency domain fusion of dual-branch information while data-adaptively guiding the SNN branch to retain more critical information. Results show that the proposed model maintains superior performance across three public datasets while achieving an average 7.5 fold reduction in computational complexity compared to baseline models.

2606.05909 2026-06-05 cs.SD eess.AS

Beyond WER: A Paired Acoustic Stress Test for Ambient Clinical Scribes

超越WER:面向环境临床记录员的配对声学压力测试

Xiao-Hang Jiang, Han-Jie Guo, Ying-Si Liang, Yang Ai, Zhen-Hua Ling, Lei Jiang, Zhi-Yang He

AI总结 提出配对声学压力测试方法,通过注入噪声并冻结下游模型,揭示噪声对临床推理的安全影响,发现轻微声学扰动可逆转临床意义而不显著增加词错误率,并展示轻量级缓解策略。

详情
Comments
Accepted to INTERSPEECH 2026
AI中文摘要

环境临床记录员越来越多地将自动语音识别与大型语言模型结合以自动化文档记录。然而,词错误率等传统指标掩盖了系统性的安全性退化。我们提出了一种配对声学压力测试,以隔离噪声对临床推理的因果影响。对于相同的对话,我们在保持下游模型配置不变的情况下注入多种噪声类型。关键的是,我们发现信号保真度与临床安全性之间存在危险的脱节。平稳环境噪声使词错误率仅增加了微不足道的0.71个百分点,但几乎使不安全输出的比例翻倍。我们的分析表明,轻微的声学扰动可以在不显著增加错误率的情况下逆转临床含义。此外,我们展示了一种轻量级缓解策略,该策略在噪声条件下减轻安全性退化,而无需进行模型微调。

英文摘要

Ambient clinical scribes increasingly combine Automatic Speech Recognition with Large Language Models to automate documentation. However, traditional metrics like Word Error Rate mask systemic safety degradation. We present a paired acoustic stress test to isolate the causal impact of noise on clinical reasoning. For the same dialogues, we inject diverse noise types while keeping the downstream model configuration frozen. Crucially, we uncover a dangerous disconnect between signal fidelity and clinical safety. Stationary ambient noise increased the Word Error Rate by a negligible 0.71 percentage points yet nearly doubled the rate of unsafe outputs. Our analysis reveals that minor acoustic perturbations can invert clinical meaning without substantially inflating error rates. Furthermore, we demonstrate a lightweight mitigation strategy that mitigates safety degradation under noisy conditions without requiring model fine tuning.

2606.05892 2026-06-05 eess.AS

VoCodec: A Low-bitrate Streamable Neural Speech Codec with Voicing-driven Quantization

VoCodec: 一种基于浊音驱动量化的低比特率可流式神经语音编解码器

Xiao-Hang Jiang, Yang Ai, Rui-Chen Zheng, Li-Rong Dai, Zhen-Hua Ling, Ji Wu

AI总结 提出VoCodec,通过浊音驱动量化(对浊音帧分配高比特率、清音帧分配低比特率)在1.1 kbps比特率下超越基线神经语音编解码器,相比均匀量化降低约27%比特率。

详情
Comments
Accepted to INTERSPEECH 2026
AI中文摘要

神经语音编解码器是语音传输和存储的关键,但大多数编解码器在帧间使用均匀量化,无论内容如何都分配相同的比特率,浪费比特。我们提出VoCodec,一种低比特率可流式神经语音编解码器,采用浊音驱动量化,根据感知敏感性为浊音帧分配更高比特率,为清音帧分配更低比特率。VoCodec在完全因果的编码器-量化器-解码器神经编码框架中嵌入浊音检测器,对浊音帧使用残差标量-矢量量化,对清音帧使用简单标量量化。实验表明,在LibriTTS数据集上以16 kHz采样率,即使比特率低至1.1 kbps,VoCodec也优于基线神经语音编解码器。我们的进一步实验也证实,与均匀量化策略相比,引入浊音驱动量化可以有效降低约27%的比特率。

英文摘要

Neural speech codecs are key to speech transmission and storage, but most use uniform quantization across frames, allocating the same bitrate regardless of content and wasting bits. We propose VoCodec, a low-bitrate streamable neural speech codec with voicing-driven quantization that assigns higher bitrate to voiced frames and lower bitrate to unvoiced frames according to perceptual sensitivity. VoCodec embeds a voicing detector in a fully causal encoder-quantizer-decoder neural coding framework, using residual scalar-vector quantization for voiced frames and simple scalar quantization for unvoiced ones. Experiments show that on the LibriTTS dataset at a 16 kHz sampling rate, VoCodec outperforms baseline neural speech codecs even at a bitrate as low as 1.1 kbps. Our further experiments also confirm that introducing voicing-driven quantization can effectively reduce the bitrate by approximately 27% compared with uniform quantization strategy.

2606.05889 2026-06-05 cs.SD cs.CL eess.AS

GLASS: GRPO-Trained LoRA for Acoustic Style Steering in Zero-Shot Text-to-Speech

GLASS: 基于GRPO训练的LoRA用于零样本文本转语音中的声学风格引导

Jaehoon Kang, Yejin Lee, Kyuhong Shim

AI总结 提出GLASS框架,通过GRPO训练轻量LoRA适配器实现零样本自回归TTS中可组合的声学风格控制,无需风格标签即可从奖励中学习控制。

详情
AI中文摘要

我们提出GLASS,一个用于零样本自回归文本转语音(TTS)中可组合声学风格控制的框架,该框架从生成后奖励而非风格标签中学习控制。在零样本TTS中,说话人提示通常将说话人身份与语速、音高等韵律属性纠缠在一起,使得在不改变提示本身的情况下难以改变风格。GLASS将每个声学属性视为一个由奖励定义的控制方向。对于每个控制轴,GLASS冻结TTS主干,并使用组相对策略优化(GRPO)训练一个轻量级LoRA适配器,以语音令牌长度和平均F0作为风格奖励,以WER作为可懂度锚点。由于每个控制表示为LoRA权重更新,独立训练的适配器可以通过线性LoRA算术进行交换、插值和组合,而无需重新训练主干。在语速和音高控制上的实验显示了目标风格偏移,同时保持了自然度、说话人相似性和可懂度,并展示了跨独立训练适配器的平滑插值和多轴组合。

英文摘要

We propose GLASS, a framework for composable acoustic style control in zero-shot autoregressive text-to-speech (TTS) that learns controls from post-generation rewards rather than style labels. In zero-shot TTS, a speaker prompt often entangles speaker identity with prosodic attributes such as speaking rate and pitch, making it difficult to change style without changing the prompt itself. GLASS instead treats each acoustic attribute as a reward-defined control direction. For each control axis, GLASS freezes the TTS backbone and trains one lightweight LoRA adapter with Group Relative Policy Optimization (GRPO), using speech-token length and mean F0 as style rewards and WER as an intelligibility anchor. Because each control is represented as a LoRA weight update, independently trained adapters can be swapped, interpolated, and composed through linear LoRA arithmetic without retraining the backbone. Experiments on speaking rate and pitch control show targeted style shifts while preserving naturalness, speaker similarity, and intelligibility, and demonstrate smooth interpolation and multi-axis composition across independently trained adapters.

2606.05876 2026-06-05 eess.AS

An Ultra-Low-Bitrate Neural Speech Codec with Plain-to-Pseudo Synergistic Vector Quantization

一种采用朴素到伪协同矢量量化的超低比特率神经语音编解码器

Xiao-Hang Jiang, Yang Ai, Fei Liu, Rui-Chen Zheng, Jian-Qing Gao, Zhen-Hua Ling, Ji Wu

AI总结 提出P2PSynCodec,通过朴素到伪协同矢量量化器(P2PSVQ)结合一个朴素VQ和多个伪VQ,在0.5 kbps下实现与2.0 kbps竞品相当的语音重建质量。

详情
Comments
Accepted to INTERSPEECH 2026
AI中文摘要

大多数神经语音编解码器使用残差矢量量化(RVQ),其中后续VQ贡献较小但消耗相同比特率,导致效率低下。我们提出P2PSynCodec,一种采用朴素到伪协同矢量量化器(P2PSVQ)的超低比特率神经语音编解码器。P2PSVQ由一个朴素VQ和多个伪VQ组成。朴素VQ通过量化产生基本令牌,而伪VQ通过神经预测生成辅助令牌且不产生传输比特率。因此,语音从朴素VQ令牌与预测的伪VQ令牌一起解码,大大降低了比特率。实验表明,P2PSynCodec在仅0.5 kbps的比特率下实现了与2.0 kbps竞品相当的语音重建质量,展示了超低比特率语音编码的高效率。

英文摘要

Most neural speech codecs use residual vector quantization (RVQ), in which later VQs contribute less but consume the same bitrate, leading to inefficiency. We propose P2PSynCodec, an ultra-low-bitrate neural speech codec with a plain-to-pseudo synergistic vector quantizer (P2PSVQ). P2PSVQ consists of one plain VQ and multiple pseudo VQs. The plain VQ produces basic tokens by quantization, while the pseudo VQs generate auxiliary tokens by neural prediction and incur zero transmitted bitrate. Thus, speech is decoded from the plain-VQ tokens together with predicted pseudo-VQ tokens, greatly reducing bitrate. Experiments show that P2PSynCodec achieves speech reconstruction quality comparable to competing codecs at 2.0 kbps while operating at only 0.5 kbps, demonstrating high efficiency for ultra-low-bitrate speech coding.

2606.05852 2026-06-05 cs.SD cs.AI eess.AS

UniVoice: A Unified Model for Speech and Singing Voice Generation

UniVoice: 一种用于语音和歌声生成的统一模型

Junjie Zheng, Huixin Xue, Shihong Ren, Chaofan Ding, Hao Liu, Zihao Chen

AI总结 提出UniVoice,一种基于条件流匹配的统一语音和歌声生成框架,通过将条件分解为内容、旋律和音色,并引入空旋律标记,实现单一模型同时生成自然语音和可控歌声。

详情
Comments
9 pages, 2 figures
AI中文摘要

文本到语音(TTS)和歌声合成(SVS)都旨在从符号输入生成人类声音音频,但它们对生成过程提出了不同的要求。语音生成依赖于灵活的、语言驱动的韵律,而歌声生成则需要明确的旋律控制和准确的节奏对齐。这种不匹配使得训练一个既能生成自然语音又能生成可控歌声的单一模型具有挑战性,因为与旋律相关的条件应该强烈约束歌声,但不应限制语音韵律。我们提出了UniVoice,一种基于条件流匹配的统一语音和歌声生成框架。UniVoice没有使用单一的未分化条件表示,而是将条件分解为内容、旋律和音色,这些由适合模态的编码器编码,并由共享的扩散变换器(DiT)主干网络使用。对于歌声,旋律条件由MIDI音符序列表示;对于语音,它被替换为学习的空旋律标记,使模型能够从语言和声学上下文中推断韵律。这种设计保留了歌声的显式旋律控制,同时避免了对语音施加旋律约束的需要。我们进一步将空旋律标记分析为条件流中旋律边缘化的近似。在3万小时语音和3.5万小时歌声数据上训练,UniVoice在语音上实现了5.26%的音素错误率(PER),与专用TTS系统如F5-TTS(5.21%)和CosyVoice3(5.30%)相当。在歌声生成上,UniVoice实现了16.22%的PER,优于统一基线Vevo1.5(24.72%)。

英文摘要

Text-to-speech (TTS) and singing voice synthesis (SVS) both aim to generate human vocal audio from symbolic inputs, but they impose different requirements on the generation process. Speech generation relies on flexible, language-driven prosody, whereas singing generation requires explicit melody control and accurate rhythmic alignment. This mismatch makes it challenging to train a single model that can generate both natural speech and controllable singing, since melody-related conditions should strongly constrain singing but should not restrict speech prosody. We present UniVoice, a unified speech and singing voice generation framework based on conditional flow matching. Instead of using a single undifferentiated conditioning representation, UniVoice factorizes the condition into content, melody, and timbre, which are encoded by modality-appropriate encoders and consumed by a shared Diffusion Transformer (DiT) backbone. For singing, the melody condition is represented by MIDI note sequences; for speech, it is replaced with a learned null melody token, allowing the model to infer prosody from linguistic and acoustic context. This design preserves explicit melody control for singing while avoiding the need to impose melody constraints on speech. We further analyze the null melody token as an approximation to melody marginalization in the conditional flow. Trained on 30k hours of speech and 35k hours of singing data, UniVoice achieves a speech PER of 5.26\%, comparable to dedicated TTS systems such as F5-TTS (5.21\%) and CosyVoice3 (5.30\%). On singing generation, UniVoice achieves a PER of 16.22\%, outperforming the unified baseline Vevo1.5 (24.72\%).

2606.05851 2026-06-05 eess.SY cs.SY

Mixed Potential Approach to Convergence of Nonlinear RLC Circuits with Memristors

含忆阻器的非线性RLC电路收敛性的混合势方法

Mauro Di Marco, Mauro Forti, Luca Pancioni, Giacomo Innocenti, Alberto Tesi

AI总结 本文通过引入混合势函数,利用通量-电荷分析方法,证明了含忆阻器的非线性RLC电路在电容与电感平衡条件下的收敛性,并推广了无忆阻器电路的相关结果。

详情
AI中文摘要

本文考虑一大类非线性电路,称为RLCM,包含所有四种基本电路元件,即电阻、电感、电容和忆阻器。伴随论文[1]引入了RLCM电路的混合势,推广了Brayton和Moser对无忆阻器电路的结果。本文通过混合势证明了RLCM电路收敛性的系统类Lyapunov结果。这些结果基于RLCM电路在通量-电荷域具有完整变量集的基本假设,并且大致要求电容与电感之间存在定量估计的平衡。收敛性结果对电路参数变化具有鲁棒性,并且包括忆阻器电路具有多个稳定平衡点的情况,这对于实现内容可寻址存储器(CAM)等应用具有重要意义。这些结果将先前适用于无忆阻器电路或无电感忆阻器电路的结果推广到包含所有四种基本电路元件的电路。主要证明采用通量-电荷分析方法(FCAM)在通量-电荷域分析RLCM电路。

英文摘要

The paper considers a large class of nonlinear circuits, termed RLCM, containing all four basic circuit elements, i.e., resistors, inductors, capacitors and memristors. A companion paper [1] has introduced a mixed potential for RLCM circuits generalizing that found by Brayton and Moser for circuits without memristors. In this paper, systematic Lyapunov-like results on convergence of RLCM circuits are proved by means of the mixed potential. These hold under the basic assumption that an RLCM circuit has a complete set of variables in the flux-charge domain and they require, roughly speaking, that there is a balance, which is quantitatively estimated, between capacitors and inductors. The convergence results are robust with respect to circuit parameter variations and they include cases where the memristor circuits possess multiple stable equilibrium points, which is of importance for instance to implement content addressable memories (CAMs). The results extend to circuits possessing all four basic circuit elements previous results that pertain to circuits without memristors or memristor circuits without inductors. The main proofs are conducted by using the flux-charge analysis method (FCAM) to analyze RLCM circuits in the flux-charge domain.

2606.05846 2026-06-05 cs.CL eess.AS

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

迈向真正的多语言ASR:将代码切换ASR泛化到未见语言对

Gio Paik, Hyunseo Shin, Soungmin Lee

AI总结 通过模型合并和领域泛化方法,研究从有限语言对中学到的代码切换能力能否泛化到未见语言对,实验表明双语CS-ASR模型对未见语言对有一定泛化能力但有限。

详情
Comments
ICML 2026 Workshop on Machine Learning for Audio
AI中文摘要

自动语音识别(ASR)已成为人机交互的关键技术。然而,由于跨多种语言对的代码切换(CS)语音资源严重稀缺,代码切换ASR(CS-ASR)仍然特别具有挑战性。现有方法主要通过合成CS语音生成或在有限双语数据集上进行特定语言对微调来提高CS-ASR性能。然而,这些方法面临固有的可扩展性限制,因为对CS的支持必须针对语言对单独开发,而语言对的数量随支持的语言数量呈组合增长。在这项工作中,我们研究通过模型合并和领域泛化方法,从一组有限的已见语言对中学到的CS能力是否可以泛化到未见语言对。我们的实验表明,合并的双语CS-ASR模型对未见语言对有一定程度的泛化,表明双语CS能力在语言对之间的迁移有限。

英文摘要

Automatic Speech Recognition (ASR) has become a key technology for human--AI interaction. However, code-switching ASR (CS-ASR) remains particularly challenging due to the severe scarcity of multilingual CS speech resources across diverse language pairs. Existing approaches primarily improve CS-ASR performance through synthetic CS speech generation or pair-specific fine-tuning on limited bilingual datasets. Nevertheless, these approaches face an inherent scalability limitation, as support for CS must be developed separately for language pairs whose number grows combinatorially with the number of supported languages. In this work, we investigate whether CS capabilities learned from a limited set of seen language pairs can generalize to unseen language pairs through model merging and domain generalization methods. Our experiments show that merged bilingual CS-ASR models modestly generalize to unseen language pairs, suggesting limited transfer of bilingual CS capabilities across language pairs.

2606.05840 2026-06-05 eess.SY cs.RO cs.SY

Amortized Nonlinear Model Predictive Control

摊销非线性模型预测控制

Francesco Pillitteri, Alberto Bemporad

AI总结 针对输入仿射非线性系统,提出一种基于状态依赖二次规划的单网络残差校正架构,通过可微内点层保证约束满足,实现实时非线性模型预测控制,在机械臂跟踪任务中取得数量级加速。

详情
Comments
6 pages
AI中文摘要

非线性模型预测控制需要在每个采样时刻实时求解一个约束非线性规划(NLP),这是一个计算瓶颈,限制了在资源受限硬件或高采样率下的部署。我们针对输入仿射非线性系统这一广泛类别解决了这一挑战,证明了最优控制动作可以通过一个状态依赖的二次规划(QP)来近似,其成本参数取决于当前状态和参考。我们提出了一种单网络残差校正架构:一个状态依赖的解析基线提供初始QP参数,网络仅学习匹配完整NLP解所需的校正;QP通过一个可微内点层求解,保证了第一个控制动作的约束满足。该网络使用由NLP求解器生成的数据进行离线训练,采用结合监督模仿和KKT残差惩罚的混合损失。我们在一个具有笛卡尔末端执行器跟踪的三连杆平面机械臂上验证了该方法,展示了相比NLP求解器数量级的加速,同时保持了可比的跟踪性能。

英文摘要

Nonlinear Model Predictive Control requires solving a constrained nonlinear program (NLP) in real-time at every sampling instant, a computational bottleneck that limits deployment on resource-constrained hardware or at high sampling rates. We address this challenge for the broad class of input-affine nonlinear systems to show that the optimal control move can be approximated by a state-dependent quadratic program (QP) whose cost parameters depend on the current state and reference. We propose a single-network residual-corrector architecture: a state-dependent analytic baseline provides initial QP parameters, and the network learns only the corrections needed to match the full NLP solution; the QP is solved by a differentiable interior-point layer, guaranteeing constraint satisfaction for the first control action. The network is trained offline on data generated by an NLP solver using a hybrid loss that combines supervised imitation and KKT-residual penalties. We validate the approach on a three-link planar robotic arm with Cartesian end-effector tracking, demonstrating orders-of-magnitude speedup over the NLP solver while maintaining comparable tracking performance.

2606.05825 2026-06-05 physics.soc-ph cs.SY eess.SY

On Leadership Emergence in Opinion Dynamics on Social Networks

社交网络舆论动态中的领导力涌现

Martina Alutto, Lorenzo Zino, Karl H. Johansson, Angela Fontan

AI总结 通过引入耦合动力学模型扩展Friedkin-Johnsen框架,研究社交网络中个体意见与领导力的共同演化,分析领导力涌现的条件。

详情
Comments
9 pages, 4 figures
AI中文摘要

社会群体中的领导力通过互动和意见交换动态涌现。实证证据表明,表达强烈意见的个体倾向于获得影响力,而持续的领导力关键取决于与周围社会背景保持一致。受这些观察启发,我们引入了一个耦合动力学模型,描述网络化群体中意见和领导力的同时演化。扩展Friedkin-Johnsen框架,我们将领导力表示为对社会影响的时间变化的敏感性,该敏感性根据博弈论机制演化,与社会心理学证据一致。在此设置中,个体通过表达果断但社会一致的意见来加强其领导力,而与集体状态的不一致则导致影响力丧失。我们分析了耦合动力学,并建立了充分条件来识别社交网络中哪些个体必然成为领导者,哪些个体充当追随者。

英文摘要

Leadership in social groups emerges dynamically through interaction and opinion exchange. Empirical evidence indicates that individuals expressing strong opinions tend to gain influence, while sustained leadership critically depends on maintaining alignment with the surrounding social context. Motivated by these observations, we introduce a coupled dynamical model describing the simultaneous evolution of opinions and leadership in a networked population. Extending the Friedkin-Johnsen framework, we represent leadership as a time-varying susceptibility to social influence, which evolves according to a game-theoretic mechanism, consistent with social psychology evidence. Within this setting, agents strengthen their leadership by expressing decisive yet socially coherent opinions, whereas misalignment with the collective state results in a loss of influence. We analyze the coupled dynamics and establish sufficient conditions to identify which agents necessarily emerge as leaders and which act as followers in the social network.