arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.03940 2026-06-03 eess.IV cs.CV cs.LG cs.RO

SEAOTTER: Sensor Embedded Autoencoding with One-Time Transcode for Efficient Reconstruction

SEAOTTER: 基于传感器嵌入自编码器与一次性转码的高效重建

Dan Jacobellis, Neeraja J. Yadwadkar

AI总结提出SEAOTTER框架，结合传感器嵌入自编码器与可学习JPEG转码，在200:1压缩比下实现比AVIF快7倍编码、3.5倍解码，并提升ImageNet top-1准确率8%，同时保持JPEG兼容性。

详情

AI中文摘要

在机器人系统中，使用低成本、低功耗硬件可以轻松捕获高分辨率的大量视觉数据。然而，当通过JPEG/MPEG等传统编解码器传输时，有限的带宽和机载计算资源阻碍了充分利用。较新的编解码器（如AV1/AVIF）改善了率失真权衡，但需要更多资源进行编码，在没有定制ASIC的情况下不切实际。最近的非对称自编码器在极端功率和带宽约束下提供高质量，但增加了高昂的解码成本，并使用忽略围绕JPEG等标准建立的数十年基础设施的特有格式。为了解决这些限制，我们引入了一种基于传感器嵌入自编码器与一次性转码的高效重建（SEAOTTER）的云机器人压缩框架。由于传感器、云和消费阶段面临非常不同的功率和带宽预算，SEAOTTER结合了学习潜变量的紧凑性和标准JPEG文件的广泛可用性。由于朴素转码会降低性能，我们提出了一种可学习的JPEG颜色和量化变换，能够提高全局、密集和基于视觉语言感知的准确性。使用SEAOTTER，我们为预训练的冻结编码器训练通用和任务感知的转码流水线。在200:1的压缩比下，与AVIF相比，我们观察到编码速度提高7倍，解码速度提高3.5倍，ImageNet top-1准确率提高8%，同时保持与JPEG基础设施的兼容性。我们的代码可从此https URL获取。

英文摘要

In robotics systems, vast amounts of visual data are easily captured at high resolution using low-cost, low-power hardware. Yet, limited bandwidth and on-device compute resources prevent full utilization when transmitted via conventional codecs like JPEG/MPEG. Newer codecs, like AV1/AVIF, improve the rate-distortion trade-off, but demand far more resources for encoding, impractical without custom ASICs. Recent asymmetric autoencoders deliver high quality under extreme power and bandwidth constraints, but add prohibitive decoding cost and use bespoke formats that ignore decades of infrastructure built around standards like JPEG. To address these limitations, we introduce a compression framework for cloud robotics based on a Sensor Embedded Autoencoder paired with a One-Time Transcode for Efficient Reconstruction (SEAOTTER). Because the sensor, cloud, and consumer stages face very different power and bandwidth budgets, SEAOTTER combines the compactness of a learned latent with the broad usability of a standard JPEG file. Since naive transcoding degrades performance, we propose a learnable JPEG color and quantization transform that enables increased accuracy for global, dense, and vision-language-based perception. Using SEAOTTER, we train both general-purpose and task-aware transcoding pipelines for a pre-trained, frozen encoder. At a compression ratio of 200:1 and compared to AVIF, we observe 7 times faster encoding, 3.5 times faster decoding, and +8% ImageNet top-1 accuracy, while retaining compatibility with JPEG infrastructure. Our code is available at https://github.com/UT-SysML/seaotter .

URL PDF HTML ☆

赞 0 踩 0

2606.03832 2026-06-03 eess.AS

In-the-Loop Training of Deep Feedback Cancellation for Hearing Aids

助听器深度反馈消除的环路内训练

Svantje Voit, Simon Doclo

AI总结针对助听器中声反馈限制最大增益的问题，提出一种环路内训练的深度反馈消除方法，通过两阶段训练策略使模型在高增益下保持稳定，实验证明其性能优于开环训练方法和自适应滤波器。

详情

AI中文摘要

声反馈限制了助听器的最大增益。除了基于自适应滤波的几种方法外，最近提出了一种基于深度神经网络的反馈消除（DFC）方法，该方法通过开环框架进行训练。由于开环训练的DFC（DFC-OL）在高增益推理时可能变得不稳定，本文提出了一种环路内训练的DFC（DFC-IL），将DFC直接集成到优化环路中。这使得模型在训练期间能够暴露于不稳定条件。两阶段训练策略包括在稳定系统上预训练和在更宽增益范围内微调，使DFC-IL能够学习鲁棒的啸叫抑制。在测量反馈路径上的实验结果表明，在小增益场景下，所提出的DFC-IL性能与DFC-OL相似，且两者均超过自适应滤波器的性能。在高放大增益场景下，DFC-IL通过维持系统稳定性明显优于DFC-OL。

英文摘要

Acoustic feedback limits the maximum gain in hearing aids. In addition to several approaches based on adaptive filtering, recently a deep-neural-network-based feedback cancellation (DFC) approach has been proposed, which is trained via an open-loop framework. Since open-loop-trained DFC (DFC-OL) can become unstable during inference at high gains, in this paper we propose an in-the-loop-trained DFC (DFC-IL) that integrates the DFC directly into the optimisation loop. This allows the model to be exposed to unstable conditions during training. A two-stage training strategy involving pre-training on stable systems and fine-tuning on a wider gain range enables DFC-IL to learn robust howling reduction. Experimental results on measured feedback paths demonstrate that in scenarios with small gains, the proposed DFC-IL performs similarly to DFC-OL, and both exceed the performance of adaptive filters. In scenarios with high amplification gains, DFC-IL clearly outperforms DFC-OL by maintaining system stability.

URL PDF HTML ☆

赞 0 踩 0

2606.03830 2026-06-03 eess.SP

Constrained Pinching Antenna Array Design for Sum-Rate Maximization in Multi-User PASS

面向多用户PASS和速率最大化的约束式可移动天线阵列设计

Minghao Jin, Anna Li, Tianwei Hou, Qiang Ni, Arumugam Nallanathan

AI总结针对多用户可移动天线系统，提出一种约束式可移动天线阵列（C-PAA）方案，通过联合优化阵列中心位置和天线细粒度分布，实现和速率最大化，并推导出闭式近似解以降低复杂度。

详情

AI中文摘要

可移动天线系统（PASS）最近作为一种有前景的灵活室内无线通信架构出现。然而，大多数现有的用于多用户PASS的可移动天线（PA）阵列设计要么提供有限的波束调整精度，要么需要过高的部署成本。在本文中，我们研究了一种更实用的约束式可移动天线阵列（C-PAA）辅助的下行PASS，其中多个PA被分组到一个可移动阵列中，并可以在阵列内以波长尺度进行精细调整。为了提高系统频谱效率，通过联合考虑阵列中心位置和C-PAA内的细粒度天线分布，构建了一个和速率最大化问题。首先，表征了C-PAA的结构特性，并推导了阵列孔径的显式上界。然后，开发了有效信道增益和可达用户速率的易处理近似。此外，分析了多用户和速率的优化问题，表明在实际相关条件下系统和速率函数表现出有利的单峰行为，这使得能够对最优C-PAA位置进行高效的一维搜索。为了进一步降低计算复杂度，推导了近最优阵列中心位置的闭式近似解。数值结果验证了所开发分析的准确性，并表明所提出的C-PAA方案接近理想上界，且显著优于传统的固定间距和现有的PA阵列基准方案。

英文摘要

Pinching antenna systems (PASS) have recently emerged as a promising architecture for flexible indoor wireless communications. However, most existing pinching antenna (PA) array designs for multi-user PASS either offer limited beam adaptation accuracy or require prohibitively high deployment cost. In this paper, we investigate a more practical constrained pinching antenna array (C-PAA)-assisted downlink PASS, where multiple PAs are grouped into a movable array and can be finely adjusted within the array at the wavelength scale. To improve the system spectral efficiency, a sum-rate maximization problem is formulated by jointly considering the array-center position and the fine-grained antenna distribution within the C-PAA. First, the structural properties of the C-PAA are characterized, and an explicit upper bound on the array aperture is derived. Then, tractable approximations for the effective channel gain and the achievable user rate are developed. Furthermore, the optimization problem of the multi-user sum-rate is analyzed, where the system sum-rate function is shown to exhibit a favorable unimodal behavior under practically relevant conditions, which enables an efficient one-dimensional search for the optimal C-PAA position. To further reduce the computational complexity, a closed-form approximate solution for the near-optimal array-center position is derived. Numerical results verify the accuracy of the developed analysis and demonstrate that the proposed C-PAA scheme closely approaches the ideal upper bound and significantly outperforms conventional fixed-spacing and existing PA array benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.03747 2026-06-03 eess.AS eess.SP

Stable Hybrid Cross-Attention Fusion for Audio-Visual Event Recognition

用于音视频事件识别的稳定混合交叉注意力融合

Parinaz Binandeh Dehaghani, Danilo Pena, A. Pedro Aguiar

AI总结提出一种结合VideoMAE和AST的混合交叉注意力融合框架，通过FiLM音频条件、双向交叉注意力融合和多模态Transformer编码，在AVE数据集上达到91.74%的验证准确率和83.85%的测试准确率。

详情

Comments: 6 pages, 4 Figures

AI中文摘要

音视频事件识别（AVER）对于智能城市监控系统至关重要，需要鲁棒的多模态理解复杂环境。本文提出了一种用于智慧城市环境中音视频事件识别的稳定混合交叉注意力融合框架。所提出的架构结合了预训练的视频掩码自编码器（VideoMAE）和音频频谱图Transformer（AST）表示，以及基于FiLM的音频条件、双向交叉注意力融合、多模态Transformer编码和模态-时间注意力。为了提高计算效率和训练稳定性，采用了冻结的预训练骨干网络和缓存特征提取。在AVE数据集上的大量实验表明，所提出的框架在多个评估指标上实现了评估的单模态和多模态基线中的最高平均性能，在五次独立运行中获得最佳验证准确率91.74%和测试准确率83.85±1.40%。结果表明，所提出的混合融合策略有效捕获了互补的音视频信息，并为具有挑战性的真实世界城市监控场景提供了鲁棒的多模态表示学习。

英文摘要

Audio-Visual Event Recognition (AVER) is essential for intelligent urban monitoring systems, where robust multimodal understanding of complex environments is required. This paper proposes a stable hybrid cross-attention fusion framework for audio-visual event recognition in smart urban environments. The proposed architecture combines pretrained Video Masked Autoencoder (VideoMAE) and Audio Spectrogram Transformer (AST) representations with FiLM-based audio conditioning, bidirectional cross-attention fusion, multimodal Transformer encoding, and modality-temporal attention. To improve computational efficiency and training stability, frozen pretrained backbones and cached feature extraction are employed. Extensive experiments on the AVE dataset show that the proposed framework achieves the highest average performance among the evaluated unimodal and multimodal baselines across multiple evaluation metrics, obtaining a best validation accuracy of 91.74% and a test accuracy of 83.85 plus/minus 1.40% over five independent runs. The results indicate that the proposed hybrid fusion strategy effectively captures complementary audio-visual information and provides robust multimodal representation learning for challenging realworld urban monitoring scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.03673 2026-06-03 eess.SP

Chasing Lightning: Detecting, Characterizing, and Identifying a Powerful Space-Based GNSS Interference Source

追逐闪电：检测、表征和识别一个强大的天基GNSS干扰源

Zachary L. Clements, Argyris Kriezis, Todd E. Humphreys

AI总结本文利用2019-2026年地面GNSS参考站网络数据，开发了基于接收功率的检测框架，详细描述了干扰事件的时空谱模式，并融合接收功率和到达时间差测量技术，将干扰源识别为俄罗斯“闪电”轨道预警卫星星座。

详情

Comments: Submitted for review to the Institute of Navigation journal NAVIGATION

AI中文摘要

本文分析并识别了一个自2019年以来在欧洲大陆、格陵兰和加拿大造成数十次强大瞬态广域干扰事件的天基全球导航卫星系统（GNSS）干扰源。虽然全球范围内GNSS干扰近期增加主要归因于地面或近地面源，但天基干扰源因其潜在的大范围地理覆盖以及预示着GNSS干扰的质变升级而特别令人担忧。基于2019年至2026年间从地面GNSS参考站网络收集的数据，本文（1）开发了一个基于接收功率的检测框架；（2）详细描述了由该源引起的广域干扰事件的空间、时间和频谱模式；（3）提出并分析了融合接收功率和到达时间差测量的识别技术；（4）应用这些技术将GNSS干扰源自信地识别为俄罗斯在“闪电”（Molniya）轨道上的预警卫星星座。

英文摘要

This paper analyzes and identifies a space-based Global Navigation Satellite System (GNSS) interference source that has caused scores of powerful transient wide-area interference events over continental Europe, Greenland, and Canada since 2019. While terrestrial or near-terrestrial sources are primarily responsible for the recent uptick in GNSS interference worldwide, space-based interferers are of special concern given their potential for vast geographic reach and their portent of a qualitative escalation in GNSS interference. Based on data collected between 2019 and 2026 from a network of terrestrial GNSS reference stations, this paper (1) develops a received-power-based detection framework; (2) details the spatial, temporal, and spectral patterns of wide-area interference events caused by the source; (3) presents and analyzes identification techniques that blend received-power and time-difference-of-arrival measurements; and (4) applies these techniques to confidently identify the GNSS interference source as a constellation of Russian early warning satellites in Molniya ("lightning") orbits.

URL PDF HTML ☆

赞 0 踩 0

2606.03531 2026-06-03 eess.SP

Voxel-CKM: Voxelized Radio Frequency Radiance Fields for Fast and Few-Shot CKM Construction

Voxel-CKM：用于快速少样本CKM构建的体素化射频辐射场

Hanlei Li, Guangyi Zhang, Kequan Zhou, Yunlong Cai, Guanding Yu

AI总结提出Voxel-CKM框架，通过体素化射频辐射场和向量矩阵分解实现快速、少样本的信道知识地图构建。

详情

AI中文摘要

信道知识地图（CKM）旨在根据用户位置预测信道状态信息（CSI），从而实现低开销的CSI获取。然而，现有的CKM构建方法通常需要数小时到数天的训练时间和密集的测量，导致高昂的部署成本。在本文中，我们提出了Voxel-CKM，一种新颖的体素化射频（RF）辐射场框架，用于快速和少样本的CKM构建。核心思想是用显式体素网格替代隐式神经表示，以有效捕捉无线信道的空间变化。在此基础上，我们进一步引入紧凑的向量矩阵（VM）分解，用少量矩阵和向量参数化这些体素网格，这显著加速了收敛并促进了快速CKM构建。为了实现少样本学习，我们将发射机先验作为归纳偏置纳入，以指导稀疏测量下的学习过程。此外，提出了一种总变差（TV）正则化损失，以减轻过拟合并稳定优化。实验表明，Voxel-CKM显著加速了训练收敛，并在少样本情况下提高了性能。

英文摘要

Channel knowledge maps (CKMs) are designed to predict channel state information (CSI) from user locations, thereby enabling low-overhead CSI acquisition. However, existing CKM construction methods often require hours-to-days of training time and dense measurements, resulting in substantial deployment cost. In this paper, we propose Voxel-CKM, a novel voxelized radio frequency (RF) radiance field framework for fast and few-shot CKM construction. The core idea is to replace implicit neural representations with explicit voxel grids to efficiently capture the spatial variation of wireless channels. Building upon this, we further introduce a compact vector-matrix (VM) decomposition to parameterize these voxel grids using a small set of matrices and vectors, which significantly accelerates convergence and facilitates fast CKM construction. To enable few-shot learning, we incorporate a transmitter prior as an inductive bias to guide the learning process under sparse measurements. Additionally, a total-variation (TV) regularization loss is proposed to mitigate overfitting and stabilize optimization. Experiments show that Voxel-CKM substantially accelerates training convergence and improves performance in the few-shot regime.

URL PDF HTML ☆

赞 0 踩 0

2606.03468 2026-06-03 eess.IV cs.MM cs.NI

When BBR Meets Live Streaming

当BBR遇上直播

Xu Yan, Tong Li, Bo Wu, Cheng Luo, Jiuxiang Zhu, Laizhong Cui

AI总结针对BBR在直播场景中带宽估计不准确导致的问题，提出辅助组件BBR-Copilot通过主动发送额外数据生成精确带宽样本，提升BBR在直播中的性能。

详情

AI中文摘要

最近，亚马逊、腾讯、字节跳动和华为等行业先驱已采用BBR作为其直播应用（包括TikTok Live）的拥塞控制算法。然而，BBR最初是为批量数据传输而设计的，在直播场景中面临多重挑战。在本文中，我们首先探讨了由于直播场景中带宽估计不准确而导致的BBR的两个关键问题：（i）BBR难以退出启动阶段，导致严重的自致丢包。（ii）BBR在稳定阶段以低于可用带宽的速率发送数据。然后，我们提出了BBR-Copilot，一个与BBR协作的辅助拥塞控制组件，使BBR更好地适应直播场景。BBR-Copilot通过智能地创建和发送额外数据，主动生成准确的带宽测量样本。我们在QUIC上实现了BBR-Copilot原型，并通过测试平台进行了评估。实验评估结果表明，BBR-Copilot有效提升了BBR在直播场景中的性能。

英文摘要

Recently, industrial pioneers like Amazon, Tencent, ByteDance, and Huawei have been adopting BBR as their congestion control algorithm for live-streaming applications, including TikTok Live. However, BBR, originally crafted for bulk data transmission, faces multiple challenges in live-streaming scenarios. In this paper, we first explore two key issues associated with BBR due to inaccurate bandwidth estimation in live-streaming scenarios: (i) BBR cannot easily exit its startup phase, resulting in a fierce self-inflicted loss. (ii) BBR sends data at a lower rate than the available bandwidth during its stable phase. We then propose BBR-Copilot, an auxiliary congestion control component that cooperates with BBR, making BBR better adapt to live-streaming scenarios. BBR-Copilot allows for proactively generating accurate bandwidth measurement samples by smartly creating and sending extra data. We implement the BBR-Copilot prototype upon QUIC and evaluate it via testbed. Experimental evaluation results show that BBR-Copilot effectively enhances BBR's performance in live-streaming scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.03455 2026-06-03 eess.AS cs.SD

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

WavTTS：通过直接原始波形建模实现高质量零样本TTS

Wenxi Chen, Dongya Jia, Yushen Chen, Zhikang Niu, Yuzhe Liang, Xiquan Li, Ruiqi Yan, Ziyang Ma, Guanrou Yang, Sanyuan Chen, Yue Wang, Zhuo Chen, Kai Yu, Xie Chen

AI总结提出WavTTS，首个基于流匹配与扩散Transformer的原始波形生成TTS模型，通过简单分块策略直接建模波形并集成多尺度梅尔频谱监督，在零样本TTS中接近潜在空间生成模型性能。

详情

AI中文摘要

最近，基于VAE潜在变量或梅尔频谱的扩散模型已成为零样本TTS的主流范式。尽管这些压缩表示提高了生成效率，但它们不可避免地遭受信息损失和非端到端训练的问题。理论上，直接建模原始波形可以规避这些问题；然而，由于音频信号序列长度极长，这一方向尚未充分探索且常被认为困难。为了克服这一点，我们提出了WavTTS，这是第一个原始波形生成TTS模型，显著缩小了与潜在空间生成模型的差距。基于流匹配与扩散Transformer（DiT），WavTTS通过简单的分块策略直接建模语音波形，同时集成多尺度梅尔频谱监督以在训练过程中提供感知指导。此外，我们研究了波形扩散中预测目标和噪声调度的影响，并开发了一种有效的调度设计以提高生成质量。在开源基准上的评估表明，WavTTS接近当前最先进的潜在生成零样本TTS模型的性能，同时显著优于之前的端到端语音生成模型。我们的发现证明了直接在波形空间扩展基于扩散的TTS的可行性，为端到端语音生成开辟了新方向。

英文摘要

Recently, diffusion models operating on VAE latents or mel-spectrograms have become the dominant paradigm for zero-shot TTS. Although these compressed representations improve generation efficiency, they inevitably suffer from information loss and non-end-to-end training. Theoretically, directly modeling raw waveforms circumvents these issues; however, this direction remains underexplored and is often deemed difficult due to the extremely long sequence length of audio signals. To overcome this, we propose WavTTS, the first raw waveform generative TTS model that substantially narrows the gap with latent-space generative models. Built upon the flow matching with Diffusion Transformer (DiT), WavTTS directly models speech waveforms via a simple patchification strategy, while integrating multi-scale mel-spectrogram supervision to provide perceptual guidance during training. Furthermore, we investigate the impact of prediction targets and noise scheduling in waveform diffusion, and develop an effective schedule design to improve generation quality. Evaluations on open-source benchmarks demonstrate that WavTTS closely approaches the performance of current state-of-the-art latent generative zero-shot TTS models, while substantially outperforming previous end-to-end speech generation models. Our findings demonstrate the feasibility of scaling diffusion-based TTS directly in the waveform space, opening a new direction for end-to-end speech generation.

URL PDF HTML ☆

赞 0 踩 0

2606.03370 2026-06-03 eess.IV

SMAC: Spatial-Modal Joint Modeling and Adaptive Representation Collapse for Multimodal Object Tracking

SMAC: 空间-模态联合建模与自适应表示崩溃的多模态目标跟踪

Meijing Gao, Qitai Sun, Huanyu Sun, Bingxuan Yang, Bingzhou Sun, Xu Chen, Yonghao Yan, Yuxuan Yang

AI总结针对复杂光照下多模态多目标跟踪中空间与模态特征联合建模不足及固定融合策略适应性有限的问题，提出基于空间-模态卷积融合和蒸馏提示的多模态跟踪框架，通过解耦3D卷积、幅相分解和表示崩溃网络实现自适应融合，在UniRTL数据集上取得领先性能。

详情

Comments: 12 pages, 16 figures. Code and pretrained models are available at https://github.com/QitaiSun/SMAC

AI中文摘要

复杂光照下的多模态多目标跟踪（MOT）由于空间和模态特征的联合建模不足以及固定融合策略的适应性有限，仍然具有挑战性。为了解决这些问题，本文提出了一种基于空间-模态卷积融合和蒸馏提示的多模态MOT框架。首先构建了空间-模态融合骨干网络，其中Basic模块通过解耦3D卷积进行空间特征提取和模态交互，而Mixed模块通过幅相分解建模非线性跨模态相关性。此外，设计了一个表示崩溃网络用于自适应多模态融合。蒸馏提示引导（DPG）模块在教师监督下生成动态模态权重，全局模态差异聚合（GMDA）模块在多模态表示崩溃过程中保留判别性信息。在UniRTL数据集上的大量实验证明了所提方法的有效性。所提跟踪器在RNT模态上达到63.31 HOTA和79.21 MOTA，优于多种最先进方法，同时保持有利的推理效率。源代码和预训练模型在此https URL公开提供。

英文摘要

Multimodal multi-object tracking (MOT) under complex illumination remains challenging due to insufficient joint modeling of spatial and modal features and the limited adaptability of fixed fusion strategies. To address these issues, this paper proposes a spatial-modal convolution fusion and distillation-prompt-based multimodal MOT framework. A spatial-modal fusion backbone is first constructed, where a Basic module performs spatial feature extraction and modal interaction via decoupled 3D convolution, while a Mixed module models nonlinear cross-modal correlations through amplitude-phase decomposition. In addition, a representation collapse network is designed for adaptive multimodal fusion. A Distillation Prompt Guidance (DPG) module generates dynamic modal weights under teacher supervision, and a Global Modal Difference Aggregation (GMDA) module preserves discriminative information during multimodal representation collapse. Extensive experiments on the UniRTL dataset demonstrate the effectiveness of the proposed method. The proposed tracker achieves 63.31 HOTA and 79.21 MOTA on the RNT modality, outperforming several state-of-the-art methods while maintaining favorable inference efficiency. The source code and pretrained models are publicly available at https://github.com/QitaiSun/SMAC.

URL PDF HTML ☆

赞 0 踩 0

2606.03337 2026-06-03 eess.SP

Node-Oriented Proactive Spectral Modulation: A Unified Fractional Framework for Graph Signal Denoising

面向节点的主动频谱调制：图信号去噪的统一分数阶框架

Manjun Cui, Zhichao Zhang, Yangfan He

AI总结提出一种面向节点的分数阶滤波（NOFF）框架，通过低秩约束（LRNOFF）实现局部空间适应性与主动频谱调制的统一，解决图信号去噪中的频谱刚性与过拟合问题。

详情

AI中文摘要

图信号去噪是图信号处理中的一项基本任务。面向节点的滤波方法增强了空间适应性，但由于依赖图傅里叶变换而存在频谱刚性。相反，新兴的分数阶域变换提供了关键的频谱灵活性，但其根本受限于全局共享的滤波范式，无法适应局部拓扑变化。为弥合这一差距，本文提出一种广义的面向节点分数阶滤波（NOFF）框架，该框架无缝集成了局部空间适应性与跨多种分数阶变换的主动频谱调制。然而，为所有顶点直接分配独立的满秩滤波器会导致参数空间过大，从而在随机噪声上产生严重的过拟合。为缓解这一问题，我们引入了低秩NOFF（LRNOFF）架构。通过施加严格的低秩约束，LRNOFF本质上充当了强大的隐式正则化器，防止噪声记忆并确保提取鲁棒的频谱基。此外，我们开发了一种高效的计算实现，称为LRNOFF-Fast，它在保持理论最优性的同时大幅降低了计算和内存开销。在真实数据集上的实验表明，所提出的框架达到了最先进的性能。

英文摘要

Graph signal denoising is a fundamental task in graph signal processing. While the node-oriented filtering approach enhances spatial adaptability, it suffers from spectral rigidity due to its reliance on the graph Fourier transform. Conversely, emerging fractional-domain transforms provide crucial spectral flexibility but are fundamentally limited by their globally shared filtering paradigm, failing to accommodate localized topological variations. To bridge this gap, this paper proposes a generalized node-oriented fractional filtering (NOFF) framework that seamlessly integrates localized spatial adaptability with proactive spectral modulation across various fractional transforms. However, straightforwardly assigning independent full-rank filters to all vertices incurs a prohibitive parameter space, leading to severe overfitting on random noise. To mitigate this, we introduce the low-rank NOFF (LRNOFF) architecture. By imposing a strict low-rank constraint, LRNOFF inherently acts as a powerful implicit regularizer, preventing noise memorization and ensuring the extraction of robust spectral bases. Furthermore, we develop an efficient computational implementation termed LRNOFF-Fast, which drastically reduces computational and memory overhead while preserving theoretical optimality. Experiments on real-world datasets demonstrate that the proposed framework achieves state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2606.03116 2026-06-03 eess.AS cs.AI cs.SD

AnyAudio-Judge: A Dynamic Rubric-Based Benchmark and Evaluator for Audio Instruction Following

AnyAudio-Judge：基于动态评分标准的音频指令跟随基准与评估器

Haitao Li, Tian Tan, Yuguang Yang, Shan Yang, Xie Chen

AI总结针对指令引导音频生成中复杂指令解耦困难、评估缺乏可解释性和细粒度属性匹配的问题，提出基于动态评分标准的评估范式，通过自适应分解音频描述为可验证的二元评分项，并构建包含7920个样本的双语基准和105K训练语料，结合SFT与GRPO训练专用评估器，在零样本对齐检测和下游强化学习指令对齐中取得显著提升。

详情

AI中文摘要

指令引导音频生成的快速发展凸显了对稳健对齐评估的迫切需求。当前的自动评估方法严重依赖通用大语言模型的整体评分，难以解耦复杂指令，缺乏可解释性，且无法捕捉细粒度的属性不匹配。为解决这一问题，我们引入了一种新颖的基于动态评分标准的评估范式，该范式自适应地将复杂的音频描述分解为可变数量的独立、可验证的二元评分项。为了严格基准测试这一能力，我们提出了AnyAudio-Judge Bench，一个全面的双语基准，包含7920个精心策划的样本，涵盖四个不同的音频领域（语音、声音、音乐和混合），并包含特意构建的困难负样本。此外，我们构建了一个包含105K样本的大规模语料库，并带有明确的思维链（CoT）理由，以训练我们的专用评估器——AnyAudio-Judge模型。通过采用结合监督微调（SFT）和组相对策略优化（GRPO）的训练流程，我们的模型成功将其推理路径与基于评分标准的评分机制对齐。大量实验表明，AnyAudio-Judge不仅显著增强了与最先进基线相比的零样本对齐检测，而且提供了精确且可解释的奖励信号，显著改善了音频生成下游强化学习中的指令对齐。

英文摘要

The rapid advancement of instruction-guided audio generation has highlighted the critical need for robust alignment evaluation. Current automated evaluation methods heavily rely on holistic scoring from general-purpose large language models, which struggle to decouple complex instructions, lack interpretability, and fail to capture fine-grained attribute mismatches. To address this, we introduce a novel dynamic rubric-based evaluation paradigm that adaptively decomposes complex audio captions into a variable number of independent, verifiable binary rubric items. To rigorously benchmark this capability, we propose the AnyAudio-Judge Bench, a comprehensive, bilingual benchmark comprising 7,920 meticulously curated samples across four diverse audio domains (speech, sound, music, and mixed), featuring deliberately constructed hard negatives. Furthermore, we construct a large-scale corpus of 105K samples with explicit Chain-of-Thought (CoT) rationales to train our dedicated evaluator, the AnyAudio-Judge model. By employing a training pipeline that combines Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO), our model successfully aligns its reasoning paths with the rubric-based scoring mechanism. Extensive experiments demonstrate that AnyAudio-Judge not only significantly enhances zero-shot alignment detection compared to state-of-the-art baselines, but also provides precise and interpretable reward signals that substantially improve instruction alignment in downstream reinforcement learning for audio generation.

URL PDF HTML ☆

赞 0 踩 0

2606.03013 2026-06-03 eess.SP

Fault-Aware Design for Reconfigurable Holographic Surface-Aided ISAC Systems

面向可重构全息表面辅助ISAC系统的故障感知设计

Lu Wang, Mohamadreza Delbari, Gui Zhou, Luis F. Abanto-Leon, Matthias Hollick, Vahid Jamali

AI总结针对可重构全息表面(RHS)辅助的集成感知与通信(ISAC)系统中硬件故障问题，提出一种基于块坐标下降的优化方法，通过最小化误指定克拉美-罗界(MCRB)并满足信干噪比(SINR)等约束，实现故障感知的RHS设计，平均性能提升13.7%。

详情

Comments: accepted by IEEE PIMRC 2026

AI中文摘要

可重构全息表面(RHS)辅助的集成感知与通信(ISAC)系统在实现低硬件成本和高能效的感知与通信方面具有巨大潜力。然而，现有工作很大程度上忽略了RHS中的实际硬件损伤，特别是具有不可控幅度的故障RHS元件，如果不加以处理，会降低系统性能。本文旨在填补这一空白，通过i)量化故障RHS元件对ISAC性能的影响，以及ii)优化功能性RHS元件以保持ISAC性能。具体而言，我们推导了用于感知的误指定克拉美-罗界(MCRB)和用于通信的信干噪比(SINR)，以衡量故障元件引起的性能损失。然后，我们制定了一个优化问题，在SINR、发射功率预算和RHS幅度的约束下最小化MCRB。所制定问题的高度非凸性构成了重大挑战，我们通过重新表述并提出一种基于块坐标下降的解决方案来应对，该方案结合了主要化-最小化和逐次凸逼近技术。仿真结果验证了所提方法相比未感知故障的基准实现了平均13.7%的性能提升。

英文摘要

Reconfigurable holographic surface (RHS)-aided integrated sensing and communication (ISAC) systems hold great promise for achieving both sensing and communication with low hardware costs and high energy efficiency. However, existing works largely overlook practical hardware impairments in RHSs, particularly faulty RHS elements with uncontrollable amplitudes, which degrade system performance if left unaddressed. This work aims to fill the gap by i) quantifying the impact of faulty RHS elements on ISAC performance and ii) optimizing the functional RHS elements to preserve the ISAC performance. Specifically, we derive the misspecified Cramer-Rao bound (MCRB) for sensing and the signal-to-interference-and-noise ratio (SINR) for communication to measure the performance loss caused by faulty elements. We then formulate an optimization problem that minimizes MCRB, subject to constraints on SINR, transmit power budget, and RHS amplitude. The high non-convexity of the formulated problem poses a significant challenge, which we address by reformulating and proposing a block coordinate descent-based solution incorporating majorization-minimization and successive convex approximation techniques. Simulation results verify that the proposed approach achieves an average 13.7% performance gain compared to the fault-unaware benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.02961 2026-06-03 eess.IV

AtlasGS: Brain MRI Spatial Resolution Harmonization With Shared Gaussian Geometry

AtlasGS: 基于共享高斯几何的脑MRI空间分辨率协调

Yifan Gao, Peiran Xu, Yimeng He, Haoran Li, Ziyang Long, Yufeng Wang, Ju Dong Yang, Debiao Li

AI总结提出基于高斯泼溅的共享几何框架，通过两阶段训练实现多模态MRI各向同性超分辨率重建，在多个数据集上达到最先进性能。

详情

AI中文摘要

基于高斯泼溅（GS）的共享几何框架采用两阶段训练策略，首先从各向同性结构扫描中学习显式的、特定于受试者的高斯支架编码解剖几何，然后重用以拟合稀疏切片采集的目标模态的外观。在UK Biobank、GBM和ABCD数据集上进行的跨模态（T2加权、FLAIR、DWI、ASL）、退化因子（×3、×5、×7）和病理异常（胶质母细胞瘤）的穿平面超分辨率实验证明了最先进的重建保真度。共享高斯几何能够为目标模态生成具有强结构一致性的任意视角视图，并进一步展示了自监督面内超分辨率的潜力。这项工作建立了显式几何引导表示作为一种新颖、灵活且可解释的途径，用于回顾性多对比度MRI协调和可靠的临床参考构建。源代码可在以下网址获取：this https URL

英文摘要

Splatting (GS)-based shared geometry framework adopts a two-stage training strategy, in which an explicit, subject-specific Gaussian scaffold encoding anatomical geometry is first learned from the isotropic structural scan and then reused to fit appearance for target modalities acquired with sparse slices. Experiments on the UK Biobank, GBM, and ABCD datasets for through-plane super-resolution across multiple modalities (T2-weighted, FLAIR, DWI, ASL), degradation factors ($\times 3$, $\times 5$, $\times 7$), and pathological abnormalities (glioblastoma) demonstrate state-of-the-art reconstruction fidelity. The shared Gaussian geometry enables arbitrary-view generation for target modalities with strong structural consistency and further shows potential for self-supervised in-plane super-resolution. This work establishes explicit geometry-guided representations as a novel, flexible, and interpretable pathway toward retrospective multi-contrast MRI harmonization and reliable clinical reference construction. Source code is available at: https://github.com/yfgao76/AtlasGS

URL PDF HTML ☆

赞 0 踩 0

2606.02913 2026-06-03 eess.AS cs.SD

A Comparison of Generative and Discriminative Methods for Speech Enhancement: Robustness, Complexity, and Hallucination

生成式与判别式语音增强方法的比较：鲁棒性、复杂性与幻觉

Shrishti Saha Shetu, Emanuël A. P. Habets, Andreas Brendel

AI总结本文比较了生成式和判别式深度学习方法在语音增强中的表现，分析了高/低信噪比、匹配/失配训练场景下的鲁棒性、复杂度与幻觉特性。

详情

AI中文摘要

在本研究中，我们对基于深度学习的生成式和判别式语音增强方法进行了全面的比较分析，特别是在降噪任务中。我们的研究重点在于评估它们在高低信噪比条件下的有效性，同时考虑匹配和不匹配的训练场景。我们进一步研究了训练数据量、模型收敛速度的影响，并根据所考虑的训练范式，从客观结果的角度解释了性能差异。此外，我们比较了这些方法的复杂度-性能权衡和实际可行性。为了进一步加强评估，我们研究了生成式方法在词错误率和音素相似度方面的幻觉特性。本研究得出的见解提供了经验证据，帮助研究人员和从业者理解不同方法的感知增益是否证明了其在实际应用中的计算成本是合理的。

英文摘要

In this study, we conduct a comprehensive comparative analysis of generative and discriminative deep learning-based speech enhancement methods, specifically in noise reduction tasks. Our investigation focuses on evaluating their effectiveness under high and low signal-to-noise ratio conditions, considering both matched and mismatched training scenarios. We further investigate the impact of training data volume, model convergence speed, and interpret the performance differences in terms of objective results for the considered training paradigms. Additionally, we compare the complexity-performance trade-off and the practical viability of these approaches. To further strengthen the evaluation, we study the hallucination characteristics of generative approaches in terms of word error rate and phoneme similarity. The insights derived from this study provide empirical evidence to assist researchers and practitioners in understanding whether the perceptual gains of different approaches justify their computational cost in practical applications.

URL PDF HTML ☆

赞 0 踩 0

2606.02906 2026-06-03 eess.IV cs.CV

Depth from Dual Differential Defocus and Stereo Consensus

基于双差分散焦与立体一致性的深度估计

Junjie Luo, Wei Xu, Dylan Chu, Emma Alexander, Qi Guo

AI总结提出D^3S Consensus算法，融合散焦深度与立体视觉，在超出景深范围内实现高精度深度估计，通过物理独立线索的一致性选择可靠预测，以更小基线达到可比工作范围。

详情

AI中文摘要

我们提出了D^3S Consensus，一种基于物理的闭式算法，它统一了散焦深度（DfD）和立体视觉，在超出相机景深（DoF）的扩展工作范围内实现高精度深度估计。给定一对双散焦立体图像，该方法通过一种新颖的DfD理论——双差分散焦（D^3）和（S）立体耦合方式，估计一组超定深度。然后，通过在这些物理独立线索之间强制执行一致性，从该组中选择最可信的深度预测，以拒绝不可靠的估计。分析表明，在相同误差容限下，D^3S与先前的基于三角测量的深度估计系统相比，以10倍小的基线实现了可比的工作范围。这使得紧凑的无源双目测距仪具有比传统立体和DfD设计小得多的外形尺寸。我们展示了第一个D^3S原型，其基线仅为4毫米，EFL为12毫米。它通过单次采集生成高达900×1800像素的深度图，在0.3-1.64米范围内平均绝对误差为1厘米。这已经超过了某些具有更大外形尺寸的商用立体相机的报告精度。

英文摘要

We introduce D^3S Consensus, a physics-based, closed-form algorithm that unifies depth-from-defocus (DfD) and stereo to achieve highly accurate depth estimation throughout an extended working range beyond the depth-of-field (DoF) of cameras. Given a pair of dual-defocus stereo images, the method estimates an overdetermined set of depth using a novel DfD theory, Dual Differential Defocus (D^3), and (S)tereo in a coupled fashion. It then picks the most confident depth prediction from the set by enforcing consensus between these physically independent cues to reject unreliable estimates. Analysis shows that D^3S achieves a comparable working range under the same error tolerance with 10x smaller baseline than previous triangulation-based depth estimation systems. This enables compact passive binocular rangefinders with substantially smaller form factors than conventional stereo and DfD designs. We demonstrate the first D^3S prototype with only 4 mm baseline and 12 mm EFL. It generates up to 900 x 1800-pixel depth maps with 1-cm mean absolute error over 0.3-1.64 m from a snapshot acquisition. This has surpassed the reported accuracy of certain commercially available stereo cameras with much larger form factors.

URL PDF HTML ☆

赞 0 踩 0

2606.02891 2026-06-03 eess.SP

Global Unknown Estimation: A Statistical Framework for Wireless Distributed Learning

全局未知估计：无线分布式学习的统计框架

Yicheng Qu, Ali Bereyhi, Ben Liang

AI总结针对无线分布式学习中空中计算聚合效果有限的问题，提出全局未知估计（GUE）统计框架，将模型聚合视为推断任务，在低信噪比下相比空中计算可降低约15 dB所需功率。

详情

AI中文摘要

空中计算（AirComp）广泛用于无线分布式学习中的模型聚合。尽管它提高了通信效率，但我们认为由于AirComp聚合的目标问题与分布式学习的目标问题存在差异，其有效性有限。在本文中，我们为无线分布式学习中的最优模型聚合开发了一个严谨的公式。利用该公式，我们表明AirComp聚合通常假设局部参数的统计模型不匹配。然后，我们提出了一种用于模型聚合的统计框架，称为全局未知估计（GUE）。它捕捉了局部和全局模型参数之间的统计关系，允许将模型聚合解释为推断任务。我们通过数值实验验证了GUE的效率。我们的结果表明，在低信噪比区域，与AirComp聚合相比，GUE可以将模型聚合所需的功率降低约15 dB。值得注意的是，这一增益是在没有额外计算开销的情况下获得的。

英文摘要

Over-the-air computation (AirComp) is widely used for model aggregation in wireless distributed learning. Although it enhances communication efficiency, we believe the AirComp aggregation has limited effectiveness due to the difference between its target problem and that of distributed learning. In this paper, we develop a rigorous formulation for optimal model aggregation in wireless distributed learning. Using this formulation, we show that AirComp aggregation generally assumes a mismatched statistical model for local parameters. We then propose a statistical framework for model aggregation, called global unknown estimation (GUE). It captures the statistical relation between the local and global model parameters, allowing to interpret model aggregation as an inference task. We validate the efficiency of GUE through numerical experiments. Our results show that, in the low SNR regime, GUE can reduce the required power for model aggregation by approximately 15 dB compared to AirComp aggregation. Remarkably, this gain is obtained without additional computational overhead

URL PDF HTML ☆

赞 0 踩 0

2606.02782 2026-06-03 eess.SP

Short-Acquisition Contrast-Free Super-Resolution Microvascular Imaging in Rabbit Kidney

兔肾短采集无造影剂超分辨微血管成像

Zhengchang Kou, Yuning Zhao, Mingrui Liu, Rita J. Miller, Michael L. Oelze

AI总结提出基于高频超快超声和血流背向散射信号非线性波束形成的无造影剂超分辨超声微血管成像方法，仅用125毫秒数据实现8帧/秒成像，空间分辨率22.2微米，较传统功率多普勒提升三倍。

详情

AI中文摘要

超声定位显微镜（ULM）通过定位和追踪血管内微泡实现微米级微血管成像，但其对外源性造影剂和长采集时间的依赖限制了临床转化。本研究提出一种基于高频超快超声和血流背向散射信号非线性波束形成的高帧率无造影剂超分辨超声微血管成像方法。每幅图像仅使用125毫秒的体内超快数据，在兔肾模型中实现了8帧/秒的成像帧率。重建的微血管图像在23.04 x 15.18 mm²的视场中分辨出全局空间分辨率为22.2微米的血管，而超声波长为67.5微米。这相当于在相同采集时长下，较传统功率多普勒成像提升三倍。与传统血流成像相比，该方法无需注射微泡即可提供更好的微血管对比度和更精细的血管描绘。这些结果为微血管评估的高帧率、无造影剂超分辨超声成像提供了一条实用途径。

英文摘要

Ultrasound localization microscopy (ULM) enables micrometer-scale microvascular imaging by localizing and tracking intravascular microbubbles, but its dependence on exogenous contrast agents and long acquisition times limits clinical translation. This study presents a high-frame-rate contrast-free super-resolution ultrasound microvascular imaging method based on high-frequency ultrafast ultrasound and nonlinear beamforming of backscatter signals from native blood flow. Using only 125 milliseconds of in vivo ultrafast data per image, the proposed method achieved an imaging frame rate of 8 frames/s in a rabbit kidney model. The reconstructed microvascular images resolved vessels with a global spatial resolution of 22.2 um over a field of view of 23.04 x 15.18 mm2, where the wavelength of ultrasound was 67.5 um. This corresponds to a three-fold improvement over conventional power Doppler imaging under the same acquisition duration. Compared with conventional flow imaging, the proposed method provided improved microvascular contrast and finer vessel delineation without microbubble injection. These results demonstrate a practical pathway toward high frame rate, contrast-free super-resolution ultrasound imaging for microvascular assessment.

URL PDF HTML ☆

赞 0 踩 0

2606.02771 2026-06-03 eess.SP

A data-driven filter bank framework for IMU-based heave motion estimation

基于数据驱动的滤波器组框架用于IMU垂荡运动估计

Aybars Tokta

AI总结提出一种数据驱动框架，通过优化一组与特定频率范围相关的IIR滤波器，利用合成数据集实现IMU垂荡运动的高精度鲁棒估计。

详情

Comments: 6 pages, 10 figures

AI中文摘要

在本研究中，我们解决了惯性导航系统中基于IMU的垂荡运动估计问题。与现有方法不同，我们提出了一种数据驱动框架，其中一组IIR滤波器（每个滤波器与特定频率范围相关）使用合成的真实垂荡-加速度元组数据集进行优化。合成垂荡信号生成流程首先从已知的波浪能谱合成随机波浪信号，然后通过文献中报告的垂荡响应幅值算子进行处理。相应的垂直加速度测量值通过对垂荡信号进行二次微分并添加真实IMU记录中观测到的低频和高频扰动获得。使用基于傅里叶变换的方法估计平均峰值周期并选择合适的滤波器。离线和实时测试的仿真结果表明，该方法对变化的海况具有鲁棒性，并提供准确的垂荡估计，最大RMSE不超过5厘米或有效垂荡高度的5%中的较大值。

英文摘要

In this study, we address the IMU-based heave motion estimation problem for inertial navigation systems. Unlike existing approaches, we propose a data-driven framework in which a bank of IIR filters, each associated with a specific frequency range, is optimized using a synthetically generated dataset of realistic heave-acceleration tuples. The synthetic heave signal generation pipeline starts by synthesizing random wave signals from established wave energy spectra and then processing them through heave response amplitude operators reported in the literature. The corresponding vertical acceleration measurements are obtained by double-differentiating the heave signals and corrupting them with realistic low- and high-frequency disturbances observed in real IMU recordings. A Fourier-transform-based method is used to estimate the mean peak period and select the appropriate filter. Simulation results from both offline and real-time tests demonstrate that the proposed method is robust to varying sea regimes and provides accurate heave estimation, with a maximum RMSE not exceeding the larger of 5 cm or 5% of the significant heave height.

URL PDF HTML ☆

赞 0 踩 0

2606.02661 2026-06-03 eess.IV cs.AI cs.LG

Learning to Refine: Spectral-Decoupled Iterative Refinement Framework for Precipitation Nowcasting

学习细化：用于降水临近预报的频谱解耦迭代细化框架

Yunlong Zhou, Chen Zhao, Danyang Peng, Fanfan Ji, Xiao-Tong Yuan

AI总结提出频谱解耦迭代细化框架（SDIR），通过双路径设计（SFG-Former和FR-Refiner）和物理一致功率谱密度损失，在确定性框架中实现降水临近预报的渐进频率解耦细化，消除模糊和幻觉，在空间精度和频谱保真度上超越现有方法。

详情

Comments: 21 pages, 10 figures, accepted at ICML 2026

AI中文摘要

准确的降水临近预报对减灾至关重要，但深度学习方法面临关键权衡：回归模型产生过度平滑、频谱衰减的预测，模糊对流细节并违反湍流幂律；扩散模型生成逼真但无锚定的幻觉，缺乏物理基础。我们提出频谱解耦迭代细化（SDIR），一个确定性框架，将临近预报重新表述为渐进频率解耦细化。SDIR首先提取稳定的低频天气尺度骨架，然后在物理约束下迭代细化高频纹理，消除模糊和幻觉。它采用双路径设计：天气尺度频率引导前馈网络（SFG-Former）使用尺度自适应Transformer处理全局结构，傅里叶残差细化器（FR-Refiner）使用尺度条件傅里叶神经算子处理精细残差。具有动态掩蔽的物理一致功率谱密度（PCPSD）损失强制执行湍流一致的频谱分布。在三个基准上的实验表明，SDIR在空间精度上显著优于最先进方法，同时实现了与基于扩散方法竞争的频谱保真度，实现了可靠的高分辨率业务化临近预报。代码链接：this https URL。

英文摘要

Accurate precipitation nowcasting is vital for disaster mitigation, but deep learning methods face a key trade-off: regression models produce over-smoothed, spectrally decaying predictions that blur convective details and violate turbulence power laws; diffusion models generate realistic yet unanchored hallucinations lacking physical grounding. We propose Spectral-Decoupled Iterative Refinement (SDIR), a deterministic framework that reformulates nowcasting as progressive frequency-decoupled refinement. SDIR first extracts a stable low-frequency synoptic skeleton, then iteratively refines high-frequency textures under physical constraints, eliminating both blurring and hallucinations. It features a dual-path design: the Synoptic Frequency-Guided Former (SFG-Former) with Scale-Adaptive Transformers for global structure, and the Fourier Residual Refiner (FR-Refiner) with Scale-Conditioned Fourier Neural Operators for fine residuals. A Physically Consistent Power Spectral Density (PCPSD) loss with dynamic masking enforces a turbulence-consistent spectral distribution. Experiments on three benchmarks show SDIR significantly outperforms SOTA methods in spatial accuracy while achieving spectral fidelity competitive with diffusion-based methods, enabling reliable high-resolution operational nowcasting. Code link: https://github.com/RuntimeWarning/SDIR.

URL PDF HTML ☆

赞 0 踩 0

2606.02642 2026-06-03 eess.AS cs.AI cs.CV cs.LG cs.MM cs.SD

SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models

SVHalluc: 音频-视觉大语言模型中的语音-视觉幻觉基准测试

Chenshuang Zhang, Kyeong Seon Kim, Chengxin Liu, Tae-Hyun Oh

AI总结针对音频-视觉大语言模型中的语音-视觉幻觉问题，提出SVHalluc基准，从语义和时间两个维度评估模型将语音内容与视觉信号对齐的能力，发现现有模型存在跨模态理解局限。

详情

Comments: Accepted at CVPR 2026

AI中文摘要

尽管音频-视觉大语言模型（LLMs）取得了成功，但它们可能产生看似合理但缺乏依据的输出，即幻觉。现有基准侧重于环境声音（例如狗叫）来指示事件发生。相比之下，人类语音承载着根本不同的、丰富的语义和时间结构，但当前模型能否准确地将语音内容与相应的视觉信号对齐仍未得到探索。在这项工作中，我们表明语音内容可以引发音频-视觉LLMs中的幻觉。为了系统研究这一点，我们引入了SVHalluc，这是第一个用于评估音频-视觉LLMs中语音-视觉幻觉的综合基准。我们的基准从两个关键且互补的方面诊断语音-视觉幻觉：语义和时间。实验结果表明，最先进的开源音频-视觉LLMs难以将语音内容与相应的视觉信号对齐，在多个任务上的准确率接近随机。相比之下，Gemini 2.5 Pro显著优于开源模型。我们的分析表明，它们的失败源于跨模态理解能力有限，尽管在单模态感知方面表现强劲。我们的工作揭示了当前音频-视觉LLMs的一个新的根本性局限，并强调了基于语音的视频理解的需求。项目页面：此https URL。

英文摘要

Despite the success of audio-visual large-language models (LLMs), they can produce plausible but ungrounded outputs, termed hallucination. Existing benchmarks focus on environmental sounds (e.g., dog barking) to indicate event occurrence. In contrast, human speech carries fundamentally different, rich semantics and temporal structures, yet it remains unexplored whether current models can accurately align speech content with corresponding visual signals. In this work, we show that speech content can induce hallucinations in audio-visual LLMs. To systematically study this, we introduce SVHalluc, the first comprehensive benchmark for evaluating speech-vision hallucination in audio-visual LLMs. Our benchmark diagnoses speech-vision hallucinations from two critical and complementary aspects: semantic and temporal. Experimental results demonstrate that state-of-the-art open-source audio-visual LLMs struggle with aligning speech content with corresponding visual signals, with a near-random accuracy on multiple tasks. In contrast, Gemini 2.5 Pro significantly outperforms the open-source models. Our analysis suggests that their failures stem from limited ability in cross-modality understanding, despite strong performance in single-modality perception. Our work uncovers a new and fundamental limitation of current audio-visual LLMs and highlights the need for speech-grounded video comprehension. Project page: https://chenshuang-zhang.github.io/projects/svhalluc/.

URL PDF HTML ☆

赞 0 踩 0

2606.02639 2026-06-03 eess.IV cs.AI cs.CV

Sparse-View Lung Nodule Volumetry from Digitally Reconstructed Radiographs via AReT: Anatomy-Regularized TensoRF

通过AReT：解剖正则化TensoRF从数字重建放射图像进行稀疏视图肺结节体积测量

Spoorthi M, Suja Palaniswamy

AI总结本文发现并解决了TensoRF在X射线衰减场中的默认密度偏移问题，提出解剖正则化张量辐射场框架AReT，仅用三个正交X射线投影即可实现肺结节的稳定体积重建，在LIDC-IDRI数据集上达到高精度。

详情

AI中文摘要

我们识别并解决了TensoRF应用于X射线衰减场时一个先前未报告的失败模式：默认密度偏移-10（最初为RGB场景重建引入）抑制了密度梯度，并阻止了稀疏视图医学重建，无论学习率或正则化策略如何。将密度偏移设置为零可恢复梯度流，并仅从三个正交X射线投影实现肺结节的稳定体积重建。在此基础上，我们提出AReT，一个解剖正则化的张量辐射场框架，用于使用LIDC-IDRI数据集（19名患者，放射科医生注释的结节）的冠状、矢状和轴向投影进行肺结节重建。与需要密集多视图采集的现有NeRF方法不同，AReT专为稀疏视图胸部成像设计，并整合了结合L1稀疏性和总变分平滑性的胸部解剖感知正则化。对11种重建策略的系统比较表明，解剖感知正则化始终优于生成先验引导的方法。与放射科医生共识分割相比，AReT在临床可操作的结节（>=10 mm，n=14）上实现了Pearson r=0.983（p<0.0001），中位绝对体积误差为11.4%，接近零的系统偏差为-77.3 mm^3，并且比球形体积近似提高了8.4倍。

英文摘要

We identify and resolve a previously unreported failure mode in TensoRF when applied to X-ray attenuation fields: the default density shift of -10, originally introduced for RGB scene reconstruction, suppresses density gradients and prevents sparse-view medical reconstruction regardless of learning rate or regularization strategy. Setting the density shift to zero restores gradient flow and enables stable volumetric reconstruction of pulmonary nodules from only three orthogonal X-ray projections. Building on this, we propose AReT, an anatomy-regularized tensorial radiance field framework for lung nodule reconstruction using coronal, sagittal, and axial projections from the LIDC-IDRI dataset (19 patients, radiologist-annotated nodules). Unlike existing NeRF approaches requiring dense multi-view acquisition, AReT is designed for sparse-view thoracic imaging and incorporates chest-anatomy-aware regularization combining L1 sparsity and total variation smoothness. A systematic comparison across 11 reconstruction strategies shows anatomy-aware regularization consistently outperforms generative-prior-guided approaches. Evaluated against radiologist consensus segmentations, AReT achieves Pearson r=0.983 (p<0.0001) for clinically actionable nodules >=10 mm (n=14), median absolute volumetric error of 11.4%, near-zero systematic bias of -77.3 mm^3, and 8.4x improvement over spherical volume approximation.

URL PDF HTML ☆

赞 0 踩 0

2606.02634 2026-06-03 eess.IV cs.AI

Echo-POSED: Geometric Self-Distillation for Echocardiography Guidance

Echo-POSED：用于超声心动图引导的几何自蒸馏

Elias Stenhede, Edvart Grüner Bjerke, Joanna Sulkowska, Eivind Bjørkan Orstad, Ole Jakob Elle, Ulysse Côté-Allard, Arian Ranjbar

AI总结提出一种自监督框架Echo-POSED，通过从3D超声心动图体积中切取2D视图训练，实现实时经胸超声心动图引导，无需专家标注视图或跟踪探头轨迹，在SO(3)×SO(3)上保持探头运动等变性，在患者内和患者间引导模拟中达到平均角度误差8.2度。

2606.02631 2026-06-03 eess.AS cs.AI cs.CV cs.LG cs.SD

Wavelet as Tokenizer: Preliminary Results on a Shared Wavelet Token Schema for Natural Signals

小波作为分词器：自然信号共享小波分词方案的初步结果

Shenghao Ding

AI总结本文研究音频、图像和视频能否共享统一的小波分词方案，通过基于Haar DWT/IDWT的连续令牌模型，在多个数据集上验证了统一分词模式的可行性，并分析了潜在容量和元数据的影响。

详情

Comments: 12 pages, 3 figures

AI中文摘要

本文研究音频、图像和视频是否可以共享一个共同的小波令牌模式，而不是依赖于各自模态特定的潜在网格。它介绍了一个初步的连续令牌模型，该模型围绕一级Haar DWT/IDWT前端、共享系数令牌布局、可选结构元数据、轻量级模态值适配器和共享的令牌级编码器-解码器主干构建。在Speech Commands、EuroSAT RGB和DAVIS 2017数据上，密集共享模型达到了39.92 dB音频、29.37 dB图像和23.93 dB视频的PSNR。在连续潜在标量预算下的匹配速率扫描表明，视觉增益不能仅由潜在容量解释，同时也表明加性元数据嵌入并非普遍改进来源。最后，固定速率能量选择提供了一个强大的非参数基线：在压缩保留比率下，energy_global相比均匀选择将音频的平均PSNR提高了16.73 dB，图像提高了16.90 dB，视频提高了15.86 dB。掩蔽稀疏训练在50%的密集令牌下达到了34.45 dB的视频PSNR。结果支持统一的 wavelet 令牌模式和稀疏令牌接口，但尚未建立通用的离散词汇表。

英文摘要

This paper studies whether audio, images, and video can share a common wavelet token schema rather than relying on separate modality-specific latent grids. It introduces a preliminary continuous-token model built around a one-level Haar DWT/IDWT frontend, a shared coefficient-token layout, optional structural metadata, lightweight modality value adapters, and a shared token-wise encoder-decoder trunk. On Speech Commands, EuroSAT RGB, and DAVIS 2017 data, a dense shared model reaches 39.92 dB audio, 29.37 dB image, and 23.93 dB video PSNR. A matched-rate sweep under continuous latent scalar budgets indicates that the visual gains are not explained solely by latent capacity, while also showing that additive metadata embeddings are not a universal source of improvement. Finally, fixed-rate energy selection provides a strong non-parametric baseline: energy_global improves average PSNR over uniform selection by 16.73 dB for audio, 16.90 dB for images, and 15.86 dB for video under compressed keep ratios. Masked sparse training reaches 34.45 dB video PSNR with 50% of dense tokens. The results support a unified wavelet token schema and sparse token interface, while stopping short of establishing a universal discrete vocabulary.

URL PDF HTML ☆

赞 0 踩 0

2606.02615 2026-06-03 eess.AS cs.AI cs.SD

FSA-GRPO: Teaching Auditory LLMs to Use Few-shot Demonstrations

FSA-GRPO：训练听觉大语言模型使用少样本示例

Haolong Zheng, Siyin Wang, Xulin Fan, Zengrui Jin, Mark Hasegawa-Johnson

AI总结提出基于强化学习的后训练方法FSA-GRPO，通过专门设计的奖励机制鼓励模型利用少样本示例，增强其少样本适应能力，在儿童语音识别、语音翻译和音频理解等任务上取得提升。

详情

AI中文摘要

少样本提示为将听觉大语言模型适应低资源任务（如儿童语音识别）提供了一种有效方式。然而，大多数听觉大语言模型并未被明确训练以在这种示例条件格式下进行推理，限制了它们从少样本提示中获益的程度。为解决这一局限，我们引入了少样本感知GRPO（FSA-GRPO），一种基于强化学习的后训练方法，使用专门设计的奖励来鼓励模型利用少样本示例，从而增强其少样本适应能力。值得注意的是，仅使用高资源成人ASR数据进行训练即可提升模型的通用少样本适应能力，不仅在儿童语音识别中带来收益，在语音翻译和音频理解中也是如此。我们进一步研究了数据选择和辅助奖励加权，以确定有效的训练方案。实验表明，当域内数据不可用或无法用于训练时，FSA-GRPO比直接对相关域外数据进行微调更有效。

英文摘要

Few-shot prompting provides an effective way to adapt auditory large language models to low-resource tasks such as children's speech recognition. However, most auditory large language models are not explicitly trained to perform inference in this demonstration-conditioned format, limiting the extent to which they can benefit from few-shot prompting. To address this limitation, we introduce Few-Shot Aware GRPO (FSA-GRPO), an RL-based post-training recipe that uses a specially designed reward to encourage the model to leverage few-shot demonstrations, thereby strengthening its few-shot adaptation ability. Notably, training with only high-resource adult ASR data improves the model's general few-shot adaptation ability, yielding gains not only in children's speech recognition but also in speech translation and audio understanding. We further study data selection and auxiliary reward weighting to identify an effective training recipe. Our experiments show that when in-domain data are unavailable or cannot be used for training, FSA-GRPO is more effective than direct tuning on related out-of-domain data.

URL PDF HTML ☆

赞 0 踩 0

2606.03957 2026-06-03 cs.CL cs.AI cs.SD eess.AS

Efficient ASR Training with Conversations that Never Happened

利用从未发生的对话进行高效的ASR训练

Máté Gedeon, Péter Mihajlik

AI总结针对低资源语言和特定领域，提出通过LLM生成对话场景、映射说话人属性到TTS语音配置文件并组装合成话语的增强流水线，实验表明合成对话能有效提升ASR性能，在匈牙利语基准上仅用67小时真实对话和636小时模拟数据即超越2700小时零样本模型。

详情

AI中文摘要

低资源语言和特定领域的对话式ASR受到领域匹配的多说话人训练数据稀缺的限制。我们提出了一种增强流水线，该流水线生成带有参与者元数据的场景级对话，将说话人属性映射到TTS语音配置文件，并将合成的话语组装成感知说话人的模拟对话。我们在相同的FastConformer-Large训练方案下，评估了五种LLM家族，分别采用单生成器、固定预算混合和扩展设置。我们在匈牙利语BEA-Dialogue基准语料库上进行了全面评估，该方法本身适用于任何语言，只要各组件有相应资源。结果表明，合成对话持续改善语音识别性能，但生成器选择和组成数据强烈影响增益。我们最大的训练配置仅使用67小时真实对话和636小时模拟数据，在评估基准上实现了比在2700小时匈牙利语语音上训练的零样本模型更好的性能。这些发现表明，通过TTS合成的LLM生成的对话数据是真实对话语料库在语音模型训练中的实用补充。

英文摘要

Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with participant metadata, maps speaker attributes to TTS voice profiles, and assembles synthesized utterances into speaker-aware simulated conversations. We evaluated five LLM families under single-generator, fixed-budget mixture, and scale-up settings using the same FastConformer-Large training recipe for each one. We ran comprehensive evaluations on the Hungarian BEA-Dialogue benchmark corpus, with the method itself being applicable to any language given the resources for each component. The results show that synthetic conversations consistently improve speech recognition performance, but generator choice and data composition strongly affect the gains. Our largest training configuration, using only 67 hours of real conversations and 636 hours of simulated data, achieves better performance on the evaluation benchmark than a zero-shot model trained on 2700 hours of Hungarian speech. These findings indicate that LLM-generated conversational data synthesized with TTS is a practical complement to real conversational corpora for speech model training.

URL PDF HTML ☆

赞 0 踩 0

2606.03931 2026-06-03 cs.RO cs.SY eess.SY

Multi-Robot Bearing-only Pose Estimation via Angle Rigidity

基于角度刚性的多机器人仅方位姿态估计

J. Francisco Presenza, Leonardo J. Colombo, Ignacio Mas, Juan I. Giribet

AI总结提出一种分布式仅方位姿态估计器，利用体坐标系方位角计算位置并恢复姿态，仅需角度刚性条件，实现局部一致指数稳定。

2606.03887 2026-06-03 eess.SY cs.SY

A Dynamic Capacity Allocation Model for DERs under Non-Firm Connection Agreements

非固定连接协议下分布式能源的动态容量分配模型

Neda Vahabzad, Kenneth Bruninx, Peter Palensky, Pedro P. Vergara

AI总结提出一种双层优化模型，在非固定连接协议下动态分配分布式能源的连接容量，平衡配电网运营商与DER所有者的目标，实现电网利用与经济效率的权衡，并显著降低削减成本。

详情

AI中文摘要

分布式能源（DER）的渗透率不断提高，通过引入双向潮流和增加对有限网络容量的竞争，加剧了配电网的拥塞，凸显了对有效且高效的拥塞管理（包括灵活的电网接入方案）的需求。本文提出了一种双层优化模型，用于在非固定连接协议下动态分配DER的连接容量，协调配电系统运营商（DSO）和DER所有者的目标。上层问题代表DSO，确定所有DER的分配连接容量（定义为最大时变功率限值），并受配电系统约束和先入先出（LIFO）分配规则的约束。下层问题代表DER所有者，在分配的功率限值内最大化每个DER的利润。该模型在改进的CIGRE中压（MV）网络上进行了测试，展示了电网利用与经济效率之间的平衡权衡。此外，该模型增强了DER集成，强制执行透明的分配规则，减少了分配模式的变异性，并且与基准方法相比，总削减成本降低了高达80%。

英文摘要

The growing penetration of distributed energy resources (DERs) intensifies congestion in distribution networks by introducing bidirectional power flows and increasing competition for limited network capacity, underscoring the need for effective and efficient congestion management, including flexible grid-access schemes. This paper proposes a bilevel optimization model for the dynamic allocation of connection capacity to DERs under non-firm connection agreements, aligning the objectives of distribution system operator (DSO) and DER owners. The upper-level problem, representing the DSO, determines the allocated connection capacity for all DERs, defined as maximum time-varying power limits, subject to distribution system constraints and the last-in-first-out (LIFO) allocation rule. The lower-level problem, representing DER owners, maximizes the profit of each DER within the allocated power limits. The proposed model is tested on a modified CIGRE medium-voltage (MV) network, demonstrating a balanced trade-off between grid utilization and economic efficiency. Furthermore, the model enhances DER integration, enforces transparent allocation rules, reduces variability in allocation patterns, and achieves up to an 80% reduction in total curtailment costs compared with benchmark methods.

URL PDF HTML ☆

赞 0 踩 0

2606.03872 2026-06-03 eess.SY cs.SY

NeuroSymbolic Robustness Analysis for Discrete Systems with Respect to Transition Deviations

针对转移偏差的离散系统神经符号鲁棒性分析

Shih-Jie Shih, Jonghan Lim, Ilya Kovalenko, Rômulo Meira-Góes

AI总结提出一种神经符号计算框架，利用大语言模型推断可行偏差转移集，再通过符号层计算离散鲁棒性保证，以解决传统方法可扩展性差和保守性问题。

详情

AI中文摘要

离散事件系统的监督控制提供了相对于工厂模型和规范的正确性形式化保证。然而，这些保证严重依赖于工厂模型，而由于建模错误或故障，工厂模型可能偏离标称行为。最近的离散鲁棒性概念将偏差建模为添加到工厂中的一组额外转移。离散鲁棒性定义为所有额外转移的集合，在这些转移下被监督的系统仍然保证期望的规范。然而，由于解空间大且大多数偏差在实践中不可行，这一概念存在可扩展性差和保守性的问题。本文提出使用神经符号计算框架来解决安全属性的离散鲁棒性分析中的这两个问题。首先，基于大语言模型的神经推理层从系统模型、规范和领域知识中推断出一组可行的偏差转移。接着，符号层在推断的偏差集上计算离散鲁棒性保证。我们在三个案例研究上评估了我们的框架，结果表明我们的方法识别出更小的可行偏差集，同时保持与完整基于转移的分析相当的鲁棒性保证。

英文摘要

Supervisory control of discrete-event systems provides formal guarantees of correctness with respect to a plant model and specification. However, these guarantees heavily rely on the plant model, which could deviate from nominal behavior due to modeling errors or faults. Recent notions of discrete robustness model deviations as a set of additional transitions that are added to the plant. The discrete robustness is defined as all sets of extra transitions for which the supervised system still guarantees a desired specification. However, this notion suffers from scalability due to the large solution space and conservatism since most deviations are infeasible in practice. This paper proposes to address these two issues using a neurosymbolic computing framework for discrete robustness analysis of safety properties. First, a neural reasoning layer based on Large Language Models infers a set of feasible deviation transitions from system models, specifications, and domain knowledge. Next, a symbolic layer computes the discrete robustness guarantees over the inferred deviation set. We evaluate our framework on three case studies, demonstrating that our method identifies a smaller set of feasible deviations while preserving robustness guarantees comparable to those of full transition-based analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.03862 2026-06-03 eess.SY cs.CC cs.SY math.OC

APX-Hardness of Computing Lipschitz Constants for Multi-Parametric Quadratic Programs

计算多参数二次规划Lipschitz常数的APX难度

Xingchen Li, Kunpeng Liu, Keyou You

AI总结本文证明了计算多参数二次规划解映射的Lipschitz常数不仅是NP难的，而且是APX难的，并揭示了当约束或决策变量数量固定时问题可多项式求解，且即使在标量参数情形下NP难和APX难仍然存在。

2606.03794 2026-06-03 cs.LG eess.SP

Limit Analysis of Graph Neural Networks with Wireless Conflict Graphs

基于无线冲突图的图神经网络极限分析

Romina Garcia Camargo, Zhiyang Wang, Alejandro Ribeiro

AI总结针对稀疏随机几何图上的图神经网络，通过分析其与确定性网格图的接近性，建立了跨尺度迁移性的理论界限，并在链路调度问题中验证了学习策略的优越性。

详情

AI中文摘要

图神经网络（GNN）已成为一种利用通信网络底层图结构进行无线资源分配的强大工具。其可迁移性使得在小规模图上训练的模型能够推广到大规模部署，且性能下降很小，这对于当前不断增长的网络而言是一个理想特性。无线网络是稀疏的，单个节点只与少量其他用户相连。本文建立了基于稀疏随机几何图（RGG）的图神经网络可迁移性的理论结果。特别地，我们关注用于建模链路间干扰的RGG冲突图。我们的方法考虑了RGG与确定性网格图（DGG）之间的接近性，以建立模型跨尺度迁移时性能损失的界限。我们通过链路调度问题验证了理论发现，表明学习策略在规模上始终优于现有基准。最后，我们考察了理论假设对经验性能的影响。

英文摘要

Graph Neural Networks (GNNs) have emerged as a powerful tool for wireless resource allocation that leverages the underlying graph structure of communication networks. Their transferability property enables models trained on small-scale graphs to generalize to large-scale deployments with little performance deterioration, a desirable property for currently growing networks. Wireless networks are sparse regimes, where a single node is connected to a small number of other users. This work establishes theoretical results for transferability of GNNs over graphs derived from sparse Random Geometric Graphs (RGGs). In particular, we focus on conflict graphs of RGGs used to model interference among links. Our approach considers the closeness between RGGs and Deterministic Grid Graphs (DGG) to establish bounds in the performance loss when a model is transferred across scales. We validate our theoretical findings through the problem of link scheduling, demonstrating that our learned policies consistently outperform existing benchmarks at scale. Finally, we examine the impact of our theoretical assumptions on empirical performance.

URL PDF HTML ☆

赞 0 踩 0